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(57) ABSTRACT 

A method for scheduling transmission of cells through a data 
switch having a plurality of inputs and outputs provides a 
plurality of buffers at each input. Each buffer corresponds to 
an output, or to a virtual circuit. A weight is assigned to each 
buffer; and buffers are selected according to a maximal 
weighted matching. Finally, cells are transmitted from the 
selected buffers to the corresponding outputs. Weights are 
based on number of credits associated with each buffer. 
Optionally, the weight is zero if the associated buffer is 
empty. A credit bucket size may be assigned to each buffer 
to limit the number of credits when the buffer is empty. 
Alternatively, weights are set to either buffer length, or to the 
number of credits, whichever is less. Or, weights may be set 
to validated waiting times associated with the oldest cells. 
Each input/output pair is assigned the maximum weight of 
any associated virtual connection. Fairness is provided in 
leftover bandwidth by determining a second matching 
between remaining inputs and outputs. Buffers are selected 
according to the second matching. In addition, a linked list 
structure is provided. Each list is associated with a weight, 
and holds references to buffers which have that weight, and 
has links to next and previous lists associated respectively 
with weights one greater and one less than the subject list's 
associated weight. Each reference is placed in a list associ- 
ated with the respective weight. Upon changing a buffer's 
weight, its reference is moved to the list corresponding to the 
new weight. Previously unselected buffers are selected from 
the lists in order descending weights. 

58 Claims, 13 Drawing Sheets 
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METHOD FOR SCHEDULING 
TRANSMISSIONS IN A BUFFERED SWITCH 

RELATED APPLICATION 

This application claims the benefit of U.S. Provisional 
Application No. 60/061,347, filed Oct. 8, 1997, the entire 
teachings of which are incorporated herein by reference. 

BACKGROUND OF THE INVENTION 

Switches and routers have traditionally employed output- 
queuing. When packets or cells arrive at an input port, they 
are immediately transferred by a high-speed switching fabric 
to the correct output port and stored in output queues. 
Various queue management policies which have been 
considered, such as virtual clock algorithms, deficit round 
robin, weighted fair queuing or generalized processor 
sharing, and many variations, have attempted to control 
precisely the time of departure of packets belonging to 
different virtual circuits (VCs) or flows or sessions, thus 
providing various quality -of-service (QoS) features such as 
delay, bandwidth and fairness guarantees. 

However, for these pure output-queuing schemes to work, 
the speed of the switching fabric and output buffer memory 
must be N times faster than the input line speed where N is 
the number of input lines, or the sum of the line speeds if 
they are not equal. This is because all input lines could have 
incoming data arriving at the same time, all needing to be 
transferred to the same output port. As line speeds increase 
to the Gb/s range and as routers have more input ports, the 
required fabric speed becomes infeasible unless very expen- 
sive technologies are used. 

To overcome this problem, switches with input-queuing 
have been used in which incoming data are first stored in 
queues at the input ports. The decision of which packets to 
transfer across the fabric is made by a scheduling algorithm. 
A relatively slower fabric transfers some of the packets or 
cells to the output ports, where they might be transmitted 
immediately, or queued again for further resource manage- 
ment. The present invention only considers the problem 
from the viewpoint of designing a fabric fast enough to 
manage input queues, regardless of whether there are also 
output queues. 

The ratio of the fabric speed to the input speed is called 
the "speedup." An output-queued switch essentially has a 
speedup of N (whereupon input queues become 
unnecessary), whereas an input-queued switch typically has 
a much lower speedup, as low as the minimum value of one, 
i.e., no speedup. The main advantage of input queuing with 
low speedup is that the slower fabric speed makes such a 
switch more feasible and scalable, in terms of current 
technology and cost. The main disadvantage is that packets 
are temporarily delayed in the input queues, especially by 
other packets in the same queues destined for different 
outputs. In contrast, with output-queuing a packet is never 
affected by other packets destined for different outputs. This 
additional input-side queuing delay must be understood or 
quantified in order for an input-queued switch to provide 
comparable QoS guarantees as an output-queued switch. 

One problem with input-queued switches is that if the 
next cell to be transmitted — that is, the cell at the head of the 
queue — is blocked because its destination port is busy, or 
perhaps because it has a low priority, all other cells queued 
up behind it are also blocked. This is known as head-of-line 
blocking. This problem is commonly resolved by allowing 
per-output queuing, in which each input has not one but M 
queues corresponding to M outputs. Thus the unavailability 
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of one output does not affect the scheduling of cells bound 
for other outputs. 

Graph theory concepts have been used to develop algo- 
rithms in attempts to efficiently select input/output pairs for 

s transmission across the switch fabric. Inputs are treated as 
one set of nodes, outputs as the second set of nodes, and the 
paths between input/output pairs having data to transmit, are 
treated as the edges of the graph. A subset of edges such that 
each node is associated with only one edge is called a 

io matching. 

L. Tassiulas, A. Ephremides, "Stability properties of con- 
strained queuing systems and scheduling policies for maxi- 
mum throughput in multihop radio networks/' IEEE Trans, 
\Automatic Control, vol.37, no. 12, December 1992, 

15 pp. 1936-1 948, presented a scheduling algorithm using 
queue lengths as edge weights and choosing a matching with 
the maximum total weight at each timeslot. The expected 
queue lengths are bounded, i.e., they do not exceed some 
bound, assuming of course that no input or output port is 

20 overbooked. That is, this is true even if the traffic pattern is 
non-uniform, and even if any or all ports are loaded arbi- 
trarily close to 100%. Hence, this "maximum weighted 
matching" algorithm, using queue lengths as weights, 
achieves 100% throughput. For an overview of the maxi- 

25 mum weighted matching problem, see e.g., Ahuja, et al, 
Network flows: theory, algorithms, and applications. Pub- 
lished: Englewood Clifirs, N.J., Prentice Hall, 1993. 

No speedup is required for this result. However, a main 
drawback preventing the practical application of this theo- 

30 retical result is that maximum weighted matching algorithms 
are complex and slow, and are therefore not suitable for 
implementation in high-speed switches. Most algorithms 
have 0(N 3 ) or comparable complexity, and large overhead. 

35 To overcome this problem, faster algorithms have 
recently been proved to achieve the same result of bounding 
expected queue lengths, and though not necessarily prior art, 
are presented here for a description of the present state of the 
art. For example, Mekkittikul and McKeown, "A Practical 
Scheduling Algorithm to Achieve 100% Throughput in 
Input-Queued Switches," IEEE INFOCOM 98, San 
Francisco, April 1998, uses maximum weighted matchings. 
However the weights are "port occupancies" defined by 
w(e (/ )^um of queue lengths of all VCs at input port i and all 

45 VCs destined to output port j. By using these edge weights, 
a faster, on the order of N 2 ' 5 (0(N 2,5 )), complexity algorithm 
can be used to find maximum weighted matchings. 

L. Tassiulas, "Linear complexity algorithms for maximum 
throughput in radio networks and input queued switches/' 

50 IEEE INFOCOM 98, San Francisco, April 1998 goes one 
step further and shows that, with the original queue lengths 
as edge weights, expected queue lengths are bounded by a 
large class of randomized algorithms. Moreover, some of 
these algorithms have 0(N 2 ) complexity or "linear 

55 complexity", i.e., linear in the number of edges. 

Mekkittikul and McKeown, "A Starvation-free Algorithm 
for Achieving 100% Throughput in an Input-Queued 
Switch," ICCCN 1996 also uses a maximum weighted 
matching algorithm on edge weights which are waiting 

60 times of the oldest cell in each queue. As a result, the 
expected waiting times, or cell delays, are bounded. This 
implies queue lengths are bounded, and hence is a stronger 
result. 

All of these results are based on Lyapunov stability 
65 analysis, and consequently, all of the theoretically estab- 
lished bounds are very loose. While the algorithm of Tas- 
siulas and Ephremides, and McKeown, Anantharam and 



01/26/2004, EAST Version: 1.4.1 



US 6,359,861 Bl 

3 4 

Walrand, "Achieving 100% Throughput in an Input-Queued In accordance with the present invention, a method for 

Switch." Proc. IEEE INFOCOM, San Francisco, March scheduling transmission of cells through a data switch, 

1996, exhibits relatively small bounds in simulations, the preferably a crossbar switch, having a plurality of inputs and 

sample randomized algorithm given in L. Tassiulas, "Linear outputs, provides a plurality of buffers at each input, each 

complexity algorithms for maximum throughput in radio s buffer corresponding to an output. The buffers temporarily 

networks and input queued switches " IEEE INFOCOM 98, hold incoming cells. A weight is assigned to each buffer; and 

San Francisco, April 1998, which is the only "linear- buffers are selected according to a weighted matching of 

complexity" algorithm above, still exhibits very large inputs and outputs. Finally, cells are transmitted from the 

bounds in simulations. To the best of our knowledge, no selected buffers to the corresponding outputs. 

linear-complexity algorithm has been shown to have small 10 Preferably, the matching requires that each buffer which is 

bounds in simulations and also provide some kind of theo- no t selected must share an input or output with a selected 

retical guarantee. buffer whose weight is greater or equal to the unselected 

Several new works have appeared recently dealing with buffer's weight. 

QoS guarantees with speedup. The earliest of these, Prab- Preferably, the matching is a maximal weighted matching 

hakar and McKeown, "On the speedup required for com- 35 an d is determined by using a stable marriage algorithm. 

bined input and output queued switching," Computer Sci- Buffers having the greatest weight are selected first, fol- 

ence Lab Technical Report, Stanford University, 1997, lowed by buffers having the next greatest weight, and so on, 

provides an algorithm that, with a speedup of four or more, until buffers having a least positive weight are assigned. 

allows an input-queued switch to exactly emulate an output- fa a preferred embo diment, assigning weights, selecting 

queued switch with FIFO queues. In other words, given any 20 ^ ^^^g ^ are pcrformed repeatedly over 

cell arrival pattern, the output patterns in the two switches conS ecutive timeslots. Within each timeslot, credits are 

arc identical. Stoica, Zhang, "Exact Emulation of an Output assi g ne d to each buffer according to a guaranteed bandwidth 

Queuing Switch by a Combined Input Output Queuing for that buffer The weights associated with eacn buffer are 

Switch," IWQoS 1998, and Chuang, God, McKeown, se( based Qn an accuraulated number of ^edits associated 

Prabhakar, "Matching Output Queuing with a Combined 25 ^ ^ b ^ PreferablVj credits are signed in integral 

Input Output Queued Switch," Technical Report CSL-TR- units mcluding zero units . 

98-758, Stanford University, April 1998 strengthen this 

* j-t' 4 *l • i "*t. • i In another preferred embodiment, the weight associated 

result in two ways. First, their algorithms require only a * » & 

speedup of two. Second, their algorithms allow emulation of Wlth » bu ?f 15 zero lf the buffer 1S cm ^ re * ardless of 

other output-queuing disciplines besides FIFO. These results 30 actual credlL 

can therefore be used with many of the common output fair In y el another preferred embodiment, a credit bucket size 

queuing schemes that have known QoS guarantees. is assigned to each buffer. If a buffer is empty and has a 

Chamy, Krishna, Patel, Simcoe, "Algorithms for Provid- n ™ be L of credits its Related credit bucketsize, 

ing Bandwidth and Delay Guarantees in Input-Buffered lhc buffer receives no forthcr crcdjts ' 

Crossbars with Speedup," IWQoS 1998, and Krishna, Patel, 35 fn still another preferred embodiment, each weight asso- 

Charny, Simcoe, "On the Speedup Required for Work- ciated with a buffer is set to either the buffers length, or to 

Conserving Crossbar Switches," IWQoS 1998, presented the number of credits associated with the buffer, preferably 

several new algorithms that are not emulation-based but whichever is less. In an enhancement to this embodiment, 

provide QoS guarantees that are comparable to those achiev- the age of each cell is mamtained, and if the age for some 

able in well-known output-queuing schemes. For example, 40 cell exceeds a predefined threshold for the corresponding 

delay bounds independent of the switch size are obtained buffer, an exception mechanism is employed to decide 

with a speedup of six . Delay bounds dependent on the switch whether to select the buffer. In another enhancement, cells 

size are obtained with a speedup of four. Finally, 100% are flushed out of the buffer with phantom cells during long 

throughput can be guaranteed with a speedup of two. idle periods. 

45 In yet another preferred embodiment, each buffer* s 

SUMMARY OF THE INVENTION weight is set to a validated waiting time associated with an 

„ IL ., . . i j j • i_ . * t , , oldest cell in the buffer. Validated waiting time for a cell is 

While theoretical studies have concentrated on the goals delermined b valjdati a ceU when there is a credit 

of bounding expected queue lengths and waiting umes, av an / r6Cording ^ of validation for each wn . 

vanous emulation studies have been earned out to mvesti- 50 ^ valida(ed waili ^ for thal cell is calculated 

gate other aspecte as well, such as average delay, packet loss ^ oq ^ between ^ and ^ 

or blocking probabilities, etc. Some of these studies also validation time 

investigated the advantage of having a small speedup of * 

about two to five (much smaller than N). The scheduling Alternatively, the validated waiting time of the oldest cell 

algorithms used may be based on matching algorithms such 55 1x1 a buffer * deteimi ^ d t0 be fher the actual waiting time 

as those of the theoretical works cited above, e.g., maximum °f the oldest cell, or the age of the oldest credit associated 

weighted matching, maximum size (unweighted) matching Wlth ^ buffer > whichever is less. 

randomized matchings, etc. In yet another alternative, the validated waiting time of 

The present invention focuses on three QoS features: the oldest cell is estimated. The estimate is based on the 

bandwidth reservations, cell delay guarantees, and fair shar- 60 actual waiting time of the oldest cell, the number of credits 

ing of unreserved switch capacity in an input-queued switch associated with the buffer, and the rate at which credits are 

with no speedup. Several embodiments employing fast, accrued. 

practical, linear-complexity scheduling algorithms are pre- In still another preferred embodiment, each buffer's 

sented which, in simulations, support large amounts of weight is scaled by a constant which is inversely propor- 

bandwidth reservation (up to 90% of switch capacity) with 65 tional to a predetermined tolerable delay. Prefereably, the 

low delay, facilitate approximate max-min fair sharing of tolerable delay associated with a buffer is the inverse of the 

unreserved capacity, and achieve 100% throughput. guaranteed bandwidth associated with the buffer. 
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In yet another preferred embodiment, a weighted match- 
ing is computed at each timeslot and a corresponding total 
edge weight for the matching determined. The total edge 
weight of the determined current matching is compared with 
the selected matching from the previous timeslot. The 
matching having the largest total edge weight is selected. 

In still another preferred embodiment, fairness is provided 
in any leftover bandwidth by determining a second matching 
between remaining inputs and outputs. Buffers are selected 
according to the second matching, and cells are transmitted 
from the selected buffers to the corresponding outputs. 
Preferably, max-min fairness is used to determine the second 
matching. Alternatively, during a second phase of weight 
assignments, additional paths are chosen based on usage 
weights. In yet another alternative, fairness is implemented 
by assigning weights based on both usage and credits. 

In yet another preferred embodiment, several virtual con- 
nections share the same input-output pair. Each virtual 
connection has its own guaranteed rate. At each input, a 
buffer is provided for each virtual connection passing 
through that input. For each input/output pair, the virtual 
connection with the maximum weight is determined, and 
that weight is assigned to the corresponding input/output 
pair. Input/output pairs are then selected based on the 
assigned weights, and according to a maximal weighted 
matching. Finally, cells are transmitted from the selected 
inputs to the corresponding outputs. 

In still another preferred embodiment, a data structure of 
linked lists is provided. Each list is associated with a weight, 
and holds references to buffers which have that weight. In 
addition, each list has links to next and previous lists 
associated respectively with weights one greater and one 
less than the subject list's associated weight. Each buffer 
reference is placed in a list associated with the weight of the 
buffer. Upon incrementing a buffer's weight by one, its 
reference is moved from its current list to the next list. 
Similarly, upon decrementing a buffer's weight by one, its 
reference is moved from its current list to the previous list. 
Finally, for each list in order of descending weights, buffers 
are selected which do not share input or output nodes with 
buffers which have already been selected. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other objects, features and advantages 
of the invention will be apparent from the following more 
particular description of preferred embodiments of the 
invention, as illustrated in the accompanying drawings in 
which like reference characters refer to the same parts 
throughout the different views. The drawings are not nec- 
essarily to scale, emphasis instead being placed upon illus- 
trating the principles of the invention. 

FIG. 1 is a schematic drawing illustrating a 3x3 crossbar 
switch with per-output input queuing. 

FIG. 2 is an illustration of a bipartite graph representing 
the 3x3 switch of FIG. 1. 

FIG. 3 is a flow chart illustrating the basic steps of the 
present invention. 

FIGS. 4A-4G are block diagrams illustrating various 
terms. 

FIGS. 5A-5C are drawings illustrating a weighted graph 
and related maximum weighted and stable marriage match- 
ings respectively. 

FIG. 6 is a flowchart of the central queue algorithm of the 
present invention. 

FIG. 7 is a block diagram illustrating the use of doubly- 
linked lists, or bags, in the central algorithm of FIG. 6. 
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FIG. 8 is a schematic diagram illustrating a credit- 
weighted embodiment of the present invention. 

FIG. 9 is a schematic diagram illustrating a credit- 
weighted embodiment of the present invention in which the 
5 weight is zero if queue length is zero. 

FIG. 10 is a schematic diagram illustrating the concept of 
credit bucket size used with either of the embodiments of 
FIGS. 8 or 9, 

FIG. 11 is a schematic diagram illustrating an 
LC-weighted embodiment of the present invention in which 
weights assigned to the paths are based on the number of 
validated cells. 

FIG. 12 is a schematic diagram illustrating a validated- 
15 waiting-time weighting embodiment of the present inven- 
tion. 

FIG. 13 is a schematic diagram illustrating a two-phase 
usage-weighted embodiment of the present invention. 

FIGS. 14A-14C are schematic diagrams illustrating a 
20 single-phase usage-credit-weighted embodiment of the 
present invention. 

FIGS. 15A-15B are schematic diagrams illustrating mul- 
tiple VCs per input-output pair. 

25 DETAILED DESCRIPTION 

FIG. 1 schematically illustrates a typical 3x3 crossbar 
switch 10 with per-output input queuing. Each packet or cell 
arriving at any of the inputs INi-INg is routed to the proper 
queue Q if according to its destination. Here, queues such as 

30 Q n and Q 12 are shown holding cells 11 while other queues 
such as Q 13 and Q 22 are empty. Paths P^- with solid lines are 
shown connecting queues with cells 11 to the switch fabric 
13, indicating possible transmission paths. Paths P 13 and P 22 
are shown in dashed lines because the corresponding queues 

35 Q 12 and Q 22 are empty, indicating that there are presently no 
cells to transmit. During each timeslot, the switch fabric 13 
transfers packets or cells from selected queues to the desti- 
nation outputs OUTj-OlTIV 

It is assumed that packets are transferred across the switch 

40 fabric 13 in fixed sized cells. Even if incoming packets have 
different sizes, however, they can be broken down into 
smaller, fixed-size cells for easier handling and re-assembled 
later. Therefore, packets and cells are used interchangeably 
hereinafter. 

45 

The switch fabric 13 may be a crossbar switch or any 
functional equivalent, and has the constraint that during any 
given timeslot, any input port can transmit to one output port 
(or none at all), and any output port can only receive from 

5Q one input port (or none at all). 

In addition, it is assumed that the switch has the minimum 
speedup of one, i.e., the fabric speed is equal to the input 
speed. The motivation is that lower speedup makes the 
switch more feasible and scalable in terms of current tech- 

5S nology and costs. A speedup of one also provides the most 
stringent testing condition for the present invention's algo- 
rithms in simulations. 

Under the above assumptions, a scheduling algorithm 
must choose a set of cells to transfer at each timeslot, with 

60 the main goals of supporting bandwidth reservations and cell 
delay guarantees. The choice is based on various parameters 
associated with each VCs queue. 

FIG. 2 illustrates a bipartite graph G=(U,V,E) which 
abstractly represents the crossbar switch of FIG. 1. The input 

65 ports IN 1 -IN 3 of FIG. 1 are represented in the graph of FIG. 
2 as a set of nodes U={I 1( I 2 , 1 3 } respectively, and the output 
ports OUTj-OUTs are represented as another set of nodes 



01/26/2004, EAST version: 1.4.1 



US 6,359,861 Bl 



8 



V={Oj, 0 2 , 0 3 }. The edges E represent possible transmis- 
sions i.e., the paths P iy of FIG. 1 over which cells could be 
transmitted from a particular input port IN ( to a particular 
output port OUT y . The set E of edges is determined once for 
each timeslot. If a particular queue is empty, there is no edge 5 
in the graph for that timeslot. 

For example, input queue Q u (FIG, 1) holds some cells 
11 destined for OUTj, therefore an edge e n is drawn in the 
graph of FIG. 2, representing a possible transmission along 
path P ir Thus edge e n corresponds to path ? n of FIG. 1, 10 
and so on. Transmissions through the switch fabric 13 can 
occur at the same time if and only if the selection of paths, 
or edges, corresponds to a "matching", i.e., a subset of edges 
M <= E such that each node has at most one connecting edge 
in M. 15 

Most scheduling algorithms, including those of the 
present invention, associate a priority or weight w I> =w(e I y) 
with each edge e iy EE. Thus most scheduling algorithms are 
characterized by two separate choices: deciding, how to 
assign edge weights w(e (> ), and computing a matching given 20 
the weighted graph (G,w). The present invention's contri- 
butions derive from judicious choices of edge weights. 

FIG. 3 is a flowchart illustrating the basic steps of a 
preferred embodiment of the present invention. First, in step 
201, the weights w^. are calculated for each edge e, y . All of 25 
the weighting algorithms of the present invention are based 
at least in part on "credits", discussed further below. 

After assigning, weights, the edges are sorted by weight 
(step 203). Then, a "central queue" algorithm is used to find 3Q 
a stable marriage matching (step 205). The resulting match- 
ing may be compared, in step 207, with the matching from 
the previous timeslot, and the matching with the highest total 
weight selected. 

If transmission paths (or edges) are still available across 35 
the switch after a matching has been selected, a fairness 
algorithm may be applied, in step 209, to select from among 
the remaining edges. The selections are added to the 
matching, and finally, in step 211, cells are transmitted 
across the switch fabric according to the selected matching. 4Q 

Note that in at least one preferred embodiment, step 209, 
the fairness algorithm, is merged with step 201, the calcu- 
lation of edge weights, such that steps 203, 205, 207 provide 
both bandwidth guarantees and fairness for leftover band- 
width. 45 

The present invention envisions a scheme where, at 
start-up time, each VC (or flow, or session) negotiates during 
an admission control process for a guaranteed transmission 
rate. The network grants, denies or modifies the requests 
based on external factors such as priority, billing, etc., in 50 
addition to current congestion level. How admission control 
makes this decision is not considered. It is simply assumed 
that the network will not overbook any resource. Once 
agreed, a VC's guaranteed rate typically does not change, 
although this is not required by the present invention. 55 

Two extreme cases clarify what it means to have band- 
width "reserved" for a VC. First, if the VC sends a smooth 
stream of cells below its guaranteed rate, then the cells 
should be transmitted with very little delay. Alternatively, if 
the VC is extremely busy and constantly has a large backlog 60 
of cells queued up, then its average transmission rate should 
be at least its guarantee rate. 

It is less clear what should happen when the VC is very 
bursty and sometimes transmits at a very high peak rate and 
sometimes becomes idle, even though its average rate is 65 
comparable to its guaranteed rate. Typical traffic is indeed 
bursty and some burstiness must be tolerated, but it is very 



difficult to design an algorithm to tolerate arbitrary amounts 
of burstiness. Thus, a compromise must be sought. We 
propose to clarify, this issue by providing a "contract" with 
each algorithm. Each VC (or user) can understand exactly 
what service to expect from each algorithm. 

Various parameters associated with a typical VC v are 
defined with the help of FIGS. 4A-4G. Cells of a VC v are 
received at input IN V and stored in the corresponding buffer 
Q v of the switch 30. L^t) denotes the queue length, i.e., the 
number of input-queued cells 32 of v at the beginning of 
timeslot t. In FIG. 4A, there are five cells in buffer so 

As shown in FIG. 4B, A^t) denotes the number of cells 
belonging to v that arrive during timeslot t. Here, a cell 35 
is arriving during timeslot t, so that A^O-l. S v (t) denotes the 
number of cells belonging to v that are transmitted across the 
switch fabric during timeslot t. Note that S v (t) can only be 
0 or 1, Here, a queued cell 37 is transmitted across the 
switch, so that S^t^l. 

Thus, as can be seen from FIG. 4C, at the beginning of the 
next timeslot t+1, 



U(t+i)=U0+A v (0-s,(t). 



(i) 



W v (t) denotes the waiting time or delay of the oldest 
input-queued cell of v, measured in units of timeslots. In a 
preferred embodiment, a cell arriving in the current timeslot 
has a minimum waiting time of one, a cell that arrived in the 
previous timeslot has a waiting time of two, etc. This is 
demonstrated in FIG. 4D. Waiting times 38 are maintained 
for each cell in the queue. The waiting time for queue W v (t) 
is then the waiting time of the oldest cell 36, so in this case 
W v (t)-10 timeslots. Of course, time stamps may be main- 
tained rather than waiting times, which may then be derived 
from the time stamps and the current time. 

VCs with an empty queue have waiting times defined to 
be zero. For example, in FIG. 4E, buffer Q v is empty. Thus, 
its delay or waiting time is zero, i.e., W v (t)«0. 

Each VC v has a guaranteed bandwidth (GBW) denoted 
as g^ measured in units of cells/timeslot. Outstanding credit, 
or simply credit, C^t), of a VC v at the beginning of a 
timeslot t is then defined by the following equation: 



cut+iKUO+g^t). 



(2) 



A VC thus gains fractional credits at a steady rate equal to 
its guaranteed rate g^ and spends one credit whenever it 
transmits a cell. An equivalent view is that 

f 



i.e., the guaranteed number of transmissions up to time t for 
VC v, less the actual number of transmissions up to time t. 

This concept of credits is demonstrated in FIG. 4F. Credits 
are maintained, conceptually at least, in a credit buffer 40. In 
practice, of course, the number of credits may be simply 
maintained in a register. For a given VC^ the number of 
credits C v grows at the continuous rate of g v credits per 
timeslot, as indicated by arrow 39. In this example, at time 
t there are two full credits 42 and a fractional credit 41. 

Though not necessary, a preferred embodiment uses the 
integral part [CXfyl as an approximation to the real number 
C^t). The difference is negligible and from here on C v (t) is 
used even when [C v (t)J is meant. Now, if t^l/gv denotes 
the number of timeslots it takes for a VC to accrue 



01/26/2004, EAST Version: 1.4.1 



US 6,359,861 Bl 



10 



one unit of credit, then since C v (t) is approximated as 
[C v (t)J, credits increase one by one, i.e., there are no 
fractional increases, and is the period between credit 
increments. 

Thus, in FIG. 4G, there are two full credits 42 and no 
fractional credit. Additional credits 43 are added to the credit 
buffer 40 at the rate of one credit every x v timeslots. Thus if 
a credit 45 is accrued at timeslot t, then another credit 46 is 
accrued at timeslot t+x v and another at t+2t v , and so on. Of 
course, x v itself is not confined to integer values. For 
example, if gy-2/5, then v-5/2, meaning that every five 
timeslots two credits should be created. One implementation 
is to accrue both at once, every five timeslots. Another 
implementation might accrue a first credit after two 
timeslots and a second credit after three timeslots. 

In practice it is likely that several VCs share the same 
input-output pair. In this case each VC has its own guaran- 
teed rate. However, for the sake of simplicity we temporarily 
restrict each VC to a distinct input-output pair. This restric- 
tion is removed below. Thus, we can write g ip L iy (t), etc., 
when we mean g v , L v (t) where v is the unique VC that goes 
from input I to output j. 

As Equation (2) and FIG. 4G show, credits arc depleted by 
decrementing the value of C v (t), or equivalently, by remov- 
ing a credit from the credit buffer, whenever a cell 47 
associated with a credit 49 is transmitted. The association is 
indicated in FIG. 4G by arrow 48. 

An input or output port can only send or receive, 
respectively, one cell per timeslot. To avoid overbooking 
resources, the network management must ensure that: 

2 ; g )7 ^l for all; (3) 

2^1 for all; (4) 

We then define the loading factor of the switch as 35 

a=max(maxf2yg,y, max^&y) (5) 

that is, the highest fractional load of all input and output 
ports. 



20 



25 



30 



Instead of slow maximum weighted matching algorithms, 
the present invention uses fast stable marriage matching 
algorithms and variations. Because such algorithms run 
faster but are less powerful for the task, we are only able to 
give a theoretical proof of boundedness in certain cases 
when the bandwidth reservations make up at most 50% of 
the switch capacity. However, in simulations, edge weights 
w=f(L,W,C) are observed to be bounded by small constants 
at much higher loading, even when 90% of the switch 
capacity is reserved. 

Since all of the scheduling algorithms of the present 
invention use different edge weights but may use the same 
matching algorithm, the matching algorithm is explained 
first. After that, the different edge weights are introduced in 
successive sections in order of increasing conceptual and 
implementational complexity. 
Stable marriage matching 

A study of combinatorial problem of stable marriage 
matchings first appeared in D. Gate, L. S. Shapley, "College 
Admissions and the stability of marriage," American Math- 
ematical Monthly vol.69, 1962, pp.9-15. In the original 
context, N men and N women each have a preference list 
ranking all persons of the opposite sex in order of preference 
for marriage. A stable marriage is a complete pairing of all 
men and all women, such that one cannot find a man and a 
woman, not married to each other, who would prefer each 
other to their current mate, the idea being that if such a pair 
exist, they would "run away" and the marriages would not 
be "stable". 

In the context of input-queued switch scheduling, stable 
marriage matchings have been considered before. See, for 
example, Nick McKeown, Scheduling Algorithms for Input- 
Queued Cell Switches, PhD Thesis, University of California 
at Berkeley, May 1995. In this context, each input i ranks all 
outputs according to the weights w(e jy ) for all j, and similarly 
each output ranks all inputs. These rankings comprise the 
preference lists. Note that while it is possible to transform 
N 2 edge weights into preference lists in this way, the reverse 
is not always possible, i.e., some sets of preference lists may 



Given these definitions, if queue lengths L t fi) are used as 40 not correspond to any set of edge weights. Ties in edge 



edge weights at time t, then E[L (> {t)], the expected value of 
L (7 (t), is bounded (using a maximum weighted matching 
algorithm, and assuming traffic streams are i.i.d. and the 
loading factor is strictly less than 100%, i.e., a<l). See 
Tassiulas and Ephremides, "Stability properties of con- 
strained queuing systems and scheduling policies for maxi- 
mum throughput in multihop radio networks," IEEE 
Trans. \Automatic Control, vol.37, no.12, December 1992, 
pp.1936-1948. and [3]. 

Similarly, Mekkittikul and McKeown, "A Starvation-free 
Algorithm for Achieving 100% Throughput in an Input- 
Queued Switch," ICCCN 1996 states that if waiting times 
W,^t) are used as edge weights, then E[W (/ (t)] is bounded. 
Tassiulas, "Linear complexity algorithms for maximum 
throughput in radio networks and input queued switches," 
IEEE INFOCOM 98, San Francisco, April 1998 states that 
for the w=L,y(t) case the same is true if a certain class of 
randomized algorithms are used instead. 

The algorithms of the present invention are designed 
according to the same general principle. Some edge weights 
are chosen, and we hope that a matching algorithm will 
make them bounded. Preferred embodiments use edge 
weights which are functions of L^, W v , C v , i.e., w(e)=f(L v (t), 
W v (t), CJ[i)). The function f() is chosen carefully so that a 
bound for E[w0] corresponds to a precise bandwidth reser- 
vation contract. Moreover, the contract can be understood in 
practical and intuitive terms. 



50 



60 



65 



weights can be broken by lexicographical order or left 
unbroken (as a slight generalization of the original problem 
setting). 

Now, the following definition of stable marriage matching 
can be used. Given a weighted bipartite graph (U,V,E,w), 
where the weights w associated with each node define that 
node's preference list, a matching Mc E is a stable marriage 
matching if, for any edge e g M, there is an edge s M E M 
such that c M and e share a common node and w(e Af )^w(e). 
Note that this definition is similar, but not equivalent, to an 
unweighted maximal matching, i.e., a matching for which it 
is not possible to add another edge. In an_unweighted 
maximal matching, for every edge unselected e g M, there 
exists a selected edge e m E M such that e M and e share a 
common node. The definition of a stable marriage matching 
adds the requirement that w(e A/ )^w(e). Thus we also refer 
to a stable marriage matching as a maximal weighted 
matching. 

FIGS. 5A-5C illustrate a weighted graph and related 
maximum weighted and stable marriage matchings respec- 
tively. FIG. 5A depicts a weighted graph 51 which 
represents, for simplicity, a 2x2 switch, with two input nodes 
Ij and I 2 , and two output nodes 0 1 and 0 2 . Weights for the 
paths P u , Pi 2 , P 21 and P 22 are indicated by the values 
w 11 =100, w 12 =90, w 21 »90 and w 22 ~l, as shown, 

FIGS. 5B and 5C illustrate the two possible maximal 
matchings derivable from the graph of FIG. 5A, that is, no 
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edges from graph 51 can be added while maintaining a First, some of the algorithms of the present invention have 

matching. Of the two matchings, FIG. SB illustrates a the property that, from one timeslot to the next, edge weights 

maximum weighted matching 53 of the graph of FIG. 5A, change by at most a small constant amount. With this 

comprising edges P u and P 22 , because its total weight, property, the edges can be maintained in sorted order by 

w 12 +w 21 =90+90=180, is greater than the total weight of any 5 using a linear, one-pass process to update the sorting from 

other matching. The total weight of FIG. 5C's matching 55 one timeslot to the next. 

is w u +w 22 = 100+ 1=101. More precisely, a doubly-linked list of bins as shown in 

Note, however, that although matching 53 of FIG. 5B is FIG. 7, is maintained, where each bin holds all edges of the 

a maximum weighted matching, it is not a stable marriage same weight. Changing an edge weight simply requires 

matching. Specifically, edge P n , shown as a dotted line, is 10 taking the edge from its current bin, or eliminating the bin 

an edge not in the matching 53 which shares a common node if this is the last edge, and putting it in a bin corresponding 

l lf with an edge P 12 in the mapping 53, wherein w(P ai )« to the new weight, or creating this bin and inserting it into 

100<w(P 12 )=90. the doubly-linked list if necessary. Increasing or decreasing 

On the other hand, although the graph 55 of FIG. 5C is not an edge weight by any constant small amount therefore takes 

a maximum weighted matching, it is a stable marriage 15 only small constant time, and sorting is maintained in linear 

matching. 0(r>P) time. 

As defined, stable marriage matchings seem not to have FIG. 7 demonstrates the use of these lists, or bins. Several 

much in common with maximum weighted matchings used bins 73-76 are shown having associated weights 72 of one, 

in no-speedup scenarios. However, it is proven in Theorem two, three and so on up to some maximum weight m, 

1 in the appendix that given a weighted bipartite graph 20 respectively. Each bin 73-76 is shown containing several 

(U,V,E,w) with non-negative weights, a stable marriage edges 77. If an edge ? n has a weight of two, it would be in 

matching X, and a maximum weighted matching Y, the total the bin 74 associated with weight two, as shown. If the 

weight of X is at least Vi of the total weight of Y. This is a weight of edge V 1X is decremented by one, P u is simply 

new theoretical result, which, combined with the Lyapunov moved from the current bin 74 to the previous bin 73 as 

analysis techniques of Tassiulas and Ephremides, and 25 indicated by arrow 78. Similarly, if the weight of edge P n 

McKeown, Anantharam and Walrand, allows us to prove , is incremented by one, P n is moved from the current bin 74 

that some of our algorithms can support bandwidth reser- to the next bin 75 as indicated by arrow 79. Now, when the 

vations of up to 50% of switch capacity, with constant delay central queue algorithm of FIG. 6 examines edges in 

bounds. descending order, it simply starts with edges in the bin 

There are several known algorithms for computing stable 30 associated with the highest weight, namely bin 76, and 

marriage matchings. In Gale and S hap ley's original works down toward the bin 73 associated with the lowest 

algorithm, each man (input) proposes to his most preferred weight. 

woman (output). Each woman accepts her most preferred Second, in simulations, edge weights were bounded by 
proposal so far, and the two become "engaged". Each small integer constants. While a theoretical proof of bound- 
unmatched man goes on to propose to his next most pre- 35 edness cannot be given, this suggests using as many bins as 
f erred woman, etc. A woman always accepts her most the bound (or twice the bound, to be safe). Edges having 
preferred proposal so far, breaking a previous engagement if weights which exceed the number of bins must still be sorted 
necessary, in which case her previous man becomes by a general sorting and so worst-case complexity is still 
unmatched again. Gale and Shapley show that the algorithm 0(N* log N), but actual complexity will usually be linear 
terminates with a stable marriage. 40 0(1^). 

Apreferred embodiment of the present invention employs Optimization: The update rule, 

a new, slightly faster algorithm which works on all edge As an optimization for the above preferred embodiment, 

weights together, instead of treating them as preference lists. at each timeslot, a stable marriage matching M is computed, 

FIG. 6 illustrates this "central queue" algorithm which and compared to the matching M' used in the previous 

assumes that edges have been sorted by weight (step 203 of 45 timeslot. Applying current weights to both matchings, the 

FIG. 3). The algorithm, corresponding to step 205 of FIG. 3, matching with the larger total edge weight is selected for the 

starts, in step 61, with an empty matching M-{0}. Edges are current timeslot. Thus it is possible that when a particularly 

selected for examination in decreasing order of weight in high- weight matching is found in one timeslot, it will be 

step 63. In step 65, a selected edge e is examined to see if used in several subsequent timeslots if the edge weights 

it can be added to M, i.e., if M U e is still a matching. If M 50 change only slowly over time. In simulations this optimi- 

U e is a matching, then edge e is added to the matching M zation was found to slightly improve performance. This 

in step 67. Otherwise edge e is discarded and the next edge optimization correspondes to step 207 of FIG. 3. 

is examined. Credit-weighted edges 

The algorithm stops when M has reached its maximum When all VCs are constantly backlogged, the bandwidth 

possible size of N edges (step 69), or when all the edges have 55 reservation problem is relatively easy to solve. In this 

been examined, indicated by 71. scenario queues are never empty, and in fact conceptually 

The central queue algorithm is thus a greedy algorithm this case can be treated as having large queue lengths and 

with no backtracking, unlike Gale and Shapley' s algorithm waiting times, i.e., Ly(t)->oo, W; ; .(t)-*°°. FIG. 8 shows a 

which allows engagements to be broken off. Theorem 2 in preferred embodiment for this scenario, using a function 

the appendix proves that this central queue algorithm com- 60 which assigns edge weights as the number of credits w=C v 

putes a stable marriage matching. (t). Recall from above that credits accrue, or are assigned, at 

Hie complexity of both algorithms is the same and equal the guaranteed bandwidth rate g v for each buffer, 
to 0(1^), i.e., linear in the number of edges, once the edge A 2x2 switch 80 is used for exemplary purposes, having 
weights are sorted, i.e., once the preference lists are pre- two inputs IN lf IN 2 and two outputs OUT^ 0UT 2 . Each 
pared. In general, sorting would increase the complexity to 65 input/output pair, for example INj/OUTj has a correspond- 
ed log N). However, there are two significant opportu- ing queue Q 12 which buffers packets or cells arriving at the 
nities for lowering the sorting complexity. input IN 2 destined for the output 0UT 2 . In addition, each 
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input/output pair has a corresponding credit buffer zero-credit VCs are not allowed to transmit and excess 

CB^-CB^ in which credits are tracked, and a path P U ,P 12 , bandwidth is simply wasted in order to create a more 

P 21 and P 22 , respectively, connecting the input to the output. stringent test condition. 
Each path P 11( P 12 , P 21 and P 22 is assigned a respective 

weight w n , w 12 , w 21 and w 22 which is calculated as some 5 TABLE 1 

function 81 of credits, queue length and waiting time, in this 
embodiment, the number of credits. Here w=C v (t). 

The number of cells in each queue 1N (/ is indicated by the 

respective queue lengths L iy . Each queue IN, y has some cells, )Q 
that is L, y >0, indicating that the VCs are backlogged. Credit 
buffer CB n has two credits (C i:i =2), so according to the 
present embodiment, the corresponding weight w n for path 
P ia connecting IN X to OUT} is w^-2. Similarly, the weights 

for paths P 12 , P 21 , and P 22 are w J2 =0, w 21 =5 and w 22 =3 15 

respectively. Since w 12 =0, path P 12 is not considered part of The simulation results are shown in Table 1 for various 

the corresponding graph and is therefore drawn as a dashed values of g max and a. For each different choice of simulation 

li ne< parameters, the experiment was run ten times. Average 

A 32x32 switch (i.e., N-32) was used in simulations. To fig™ are shown The quantity of interest is C^, the 

t , , j j * * *l * ■ f t ~a ~ ici maximum CAl) value achieved during the simulation, tor 

control the amount and distribution of guaranteed rates g^ 20 ^ ; U ^ (our simulations run for 

two simulation parameters were used: loading factor a, and ^ ti ^ Cfjlots) ^ vahlc cjm v be practicaUv treate d as 

maximum guaranteed rate g were randomly gener- & ^ for ^ ^ definit ion, ^ ^ the 

ated by considering each different (ij) pair (for all 1 S y § N) yCs number of reserved tnDStnis ^ ons up to ^ t , less its 

in random order. Each (ij) pair was considered exactly once tota] number of transmissions ( up to timc t ), so a bound on 

and the simulator generated g (> as a uniform random variable 25 c ^ ^ b& translated ^ tne f 0 u 0 wing contract: 

between 0.01 and % max . If the so generated (in conjunction ^ V C v will have its credit bounded C v (t) ^C max for all 

with other g lV already generated) increased the loading factor tj me t . In other words, at any time t, the total number of 

beyond a, then it was decreased as much as necessary to transmissions Z^S^x) will lag behind its reserved share 

keep the loading factor exactly a. (Some VCs therefore (txg v ) by at most a small constant number of cells, equal to 

might have g v =0.) This method can be viewed as a very 30 C^. 

simple admission control, wherein VCs arrive at random and This contract implies the following statement of rate 

request a random amount of bandwidth guarantee, while the equality: The time-average transmission rate equals the 

admission control maintains each input/output port's load- guaranteed rate, as t-><». In fact, the contract is a much 

ing to a or less stronger mathematical statement since it bounds the absolute 

In most simulations, this method loads every input and 35 J^ rence between actual and guaranteed transmissions for 

output port evenly, close to the loading factor, i.e., a ^ of the contra ct depends entirely on the 

Si&rS/g^- Consequently, the total throughput of the bound ^ smaUer ^ ^ stroQger afld mofC 

switch is approximately axN. Note that although each port useftl] tne contracl practicality of the credit-weighted 

is almost uniformly loaded, this is very different from algorithm (f or backlogged traffic) derives from the fact that 

"uniform loading" which means each input-output pair is tne bounds are very sma u constants, 

uniformly loaded, i.e., each g^a/N. Our simulations in fact since the edge weights change by at most one every 

use very non-uniform loading. timeslot, the sort order can be maintained from one timeslot 

As a design choice in our simulator, a VC is not allowed to the next with a one-pass linear updating procedure, 

to transmit if its credit is zero (i.e., zero-weight edges are 45 Complexity is therefore 0(N 2 ). 

dropped from the stable marriage matching), even if some Theorem 3 in the appendix proves that if the loading 

resources (switch fabric bandwidth) are not used as a result. factor a<%, then the contract is satisfied. In other words, this 

In other words, the simulated algorithms are not "work- algorithm supports any loading pattern that does not load 

conserving" in the usual sense. In real life such a choice "iy input or output to more than 50% Unfortunately the 

would be unnecessarily wasteful. However, this choice was theoretically provable bound is very bose compared to 

, . , A c ^ ry * *u- * 50 typically observed C„,„ v values. Thus, the theory is much 

made m our simulator for two reasons. First, this represents ^ * ~ performance which exhibits small 

a more stringent test on our algorithms. If they perform well ^ at ^ ^ ^ ^ is 

in this scenario, they must perform even better in the most like i y due t0 lhe inhe rent "looseness" of the Lyapunov 

non- wasteful scenario. proof techniquej ^ th e unavailability of combinatorial 

Second, in some sense a VC without credit has already 55 proof techniques for the no-speedup scenario, 

used up its reserved share of the bandwidth. Therefore, Since the bound C^ is obtained by simulation and is not 

allowing zero -credit VCs to transmit amounts to letting them a theoretical bound, one may have reservations about using 

use unreserved bandwidth. The sharing of unreserved band- sucn a bound in a "contract" or for VC admission control, 

width is considered a fairness issue and is given a more However, for no-speedup scenarios, Lyapunov analysis 

careful treatment later. 60 often yields loose bounds. No useful combinatorial proof 

Nevertheless, it is reasonable to ask whether the algo- technique is known yet. Moreover, previous works which 

rithms of the present invention can exhibit- high total use Lyapunov analysis in no-speedup scenarios, only bound 

throughput. The answer is yes. When augmented with the expected values of queue lengths, waiting times, etc., so that 

option to allow zero-credit VCs to transmit, all of the even they are not hard bounds. Thus, a soft bound obtained 

algorithms of the present invention exhibit about 90-100% 65 by simulations can be considered good enough for practical 

total throughput. Now that the throughput question is settled, purposes, especially if the VCAiser recognizes the bound is 

in all the simulations reported in the next few sections, obtained by simulations. 
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In addition, in today's networks there exists a large 
proportion of legacy, best-effort traffic that requires no 
bandwidth reservation. Therefore a< l A might not be an 
unrealistic assumption. In that case "stability" in the sense of 
bounded edge weights is guaranteed by theory, and the fact 5 
that observed bounds are much smaller than the theoretical 
bounds can be considered a fortunate bonus. 
Credit-weighted edges — empty queue exception 

The assumption of constant large traffic backlogs may not 
be realistic, and an algorithm must be able to handle 10 
non-backlogged traffic as well. In non-backlogged traffic 
some queues can be temporarily empty. Such VCs have 
nothing to transmit and must be ignored. The credit- , 
weighted algorithm of the previous embodiment requires a 
small and natural modification: VCs with empty queues 15 
(L,/t)=0) are ignored by giving them edge weights 
regardless of their actual credit. VCs with non-empty queues 
still have credits as edge weights, w,/t)=C i:/ (t), as before. 

FIG. 9 illustrates the same switch 80 as FIG. 8 but with 
the modified function 81 A. Here because the VCs are 20 
non-backlogged, the queues may be empty. Such is the case 
for queue Q 12 . Yet this virtual circuit has accumulated C 12 =7 
credits. With the weight function 81A of this embodiment, 
the weight w 12 assigned to path P 12 is zero, even though 
credits are available. Thus, path P 12 is again drawn with 25 
dashed lines to indicate it is not part of the path set. 

T\vo kinds of non-backlogged traffic were used in simu- 
lations: Bernoulli traffic and 2-state traffic. These two kinds 
of traffic share several common features: different VCs are 
completely probabilistically independent; the number of 30 
arrivals AJ(t) is always either 0 or 1; and the average arrival 
rate X v is exactly the guaranteed rate g v . We choose X v =g v for 
two reasons. First, if the average arrival rate were higher, the 
VC would eventually accumulate a large backlog, as dis- 
cussed in the previous section. On the other hand, if the 35 
average arrival rate were lower, the reservations will be 
larger than the actual traffic that needs to be transmitted and 
the algorithm's job is correspondingly easier. Therefore, 
X v =g v represents the most stringent test case for non- 
backlogged traffic. 40 

In Bernoulli traffic, for all t, Prob(A v (t)-l)=g v (and so 
Prob(A v (t)=0)=l-g v ). 2-state traffic is more bursty: at each 
t the VC can be busy or idle. In busy state Prob(Aj(t)= 
l|busy)=2g v whereas in idle state Prob(A v (t)=l|idle)=0. In 
some simulations some g w is allowed to be larger than V2. For 45 
such VCs, Prob(A v (t)=l|busy)=l, Prob(A v (t)-ltidle)=2g v -l . 
This maintains the average arrival rate at g v . State transition 
(toggling between the busy and idle states) happen with 
probability 0.2 in each timeslot. Thus lengths of busy or idle 
periods are exponentially distributed with an average length 50 
of five timeslots. 

In simulations, the credit-weighted algorithm exhibits 
much larger (and hence much less useful) bounds, as shown 
in Table 2. A closer look reveals the reason. When a VC 
becomes temporarily idle (by entering the idle state in 55 
2-state traffic or by chance in Bernoulli traffic), it simply 
collects credits, increasing CJi) as long as it stays idle, 
without limit. As long as it is idle (and ignored by the 
algorithm because w(e v )=0), it does not actually hurt other 
VCs. However, when cells arrive at this VC, for example, 60 
into queue Q 12 in FIG. 9, it suddenly has a much higher edge 
weight W 32 «C 3 2 =100 compared to others, and thus it hogs 
its input and output ports for a long time, transmitting every 
timeslot until its credit drops to a lower level comparable to 
other VCs. Meanwhile other starved non-empty VCs will 65 
accrue credits and their actual transmissions will lag behind 
their reserved shares by a large amount. 
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TABLE 2 



Traffic type 




a 






Bernoulli 


0.6 


90% 


338 


142 


Bernoulli 


0.2 


90% 


320 


32 


2-statc 


0.6 


90% 


641 


253 


2-state ■ 


0.2 


90% 


398 


45 



A reasonable way to quantify this effect is by measuring 
the quantity 

LC v (r) = mintM'), C r (0) (6) 
= no. of validated or pre-paid cells ^ 

Intuitively, this is the number of validated" or "pre-paid" 
cells, that is, the number of cells of v that have existing 
corresponding credits "earmarked" for them already. These 
cells are not waiting for transmission due to lack of credit. 
They already have credits and are waiting for transmission 
simply because of scheduling conflicts in the switch fabric. 
The bound LC max in Table 2 shows the maximum value of 
LC v (t) across all VCs and all timeslots. Both C max and 
LC max are relatively large, indicating that the credit- 
weighted embodiments do not perform well for non- 
backlogged traffic. 

Bucket-credit-weighted algorithm 

The credit-weighted embodiments described above let 
idle VCs collect arbitrarily large number of credits. A 
variation employing a bucket-credit-weighted algorithm 
explicitly prevents this from happening. Each VC during 
setup time negotiates a parameter B v called "credit bucket 
size," in addition to its guaranteed rate g v . Whenever a VC 
has an empty queue, if its credit exceeds its bucket size, it 
no longer receives any credits. In other words credits are 
updated as follows: 

C v (r + 1) = C v (t)-S v {r) if C v (/) > B v> UD = 0 (3) 
= C v (r)+^v-^v(0 otherwise. (9) 

FIG. 10, which again shows the same basic 2x2 switch 80 
as FIGS. 8 and 9, illustrates this credit bucket size. Three 
credits 82 are shown "arriving" over time, at the rate of one 
credit every l/g v timeslots. Of course, these credits simply 
accumulate over time, they do not actually arrive from 
anywhere. However, they can be thought of as arriving for 
the sake of illustration. 

A credit bucket size B^- has been negotiated for each VC 
and is shown alongside each credit buffer CB n -CB 22 . The 
weighting, function 81 B is preferably either of the functions 
81, 81 A of FIGS. 8 or 9 respectively. Actual weight values 
w ( j are not shown because they depend on the function 81 B. 

For illustrative purposes, a logical switch 84A-84D 
allows arriving credits to accumulate in the corresponding 
credit buffer CB^-CB^, respectively. However, each 
switch 84A-84D is controlled by a logic function 83 which 
implements Equations (8) and (9) above. The table below 
illustrates the four possible input conditions and the result- 
ing switch state. 
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B g 


111 


> Bi,- 




Switch 




On 


3 


10 


5 


False 


False 


Closed (credits 


accrue) 


Q12 


8 


5 


0 


Tmc 


True 


Open (no more 


credits) 


Q21 


7 


20 


0 


False 


True 


Closed 




Q22 


10 


7 


2 


True 


False 


Closed 





Note that VCs with non-empty queues such as Q 22 of FIG. 
10 still receive g v credits as before even if that would exceed 
its credit bucket size. After all, if a VC is busy and yet its 
credit exceeds its bucket size, the scheduling algorithm has 
probably not been serving this VC efficiently. Such a VC 
must not be further penalized by not receiving credits. 

For simplicity, in our simulations every VC has the same 
bucket size. The algorithm obviously does not require this 
and indeed, both g v and B v are negotiable parameters during 
VC start-up. If a VC can negotiate a larger the scheduling 
algorithm will tolerate a higher degree of burstiness from 
this VC. 



TABLE 3 



Traffic Type 


Smix 


a 


B v 






M max 


LCM max 


Bernoulli 


0.6 


90% 


40 


40 


38 


230 


170 


Bernoulli 


0.6 


90% 


10 


10 


10 


305 


170 


Bernoulli 


0.2 


90% 


40 


40 


18 


210 


135 


Bernoulli 


0.2 


90% 


10 


10 


6.8 


350 


183 


2-state 


0.6 


90% 


40 


40 


38 


554 


350 


2-state 


0.6 


90% 


10 


10 


10 


505 


212 


2-state 


0.2 


90% 


40 


40 


18 


313 


150 


2-state 


0.2 


90% 


10 


10 


6.8 


403 


200 



Simulation results are shown in Table 3. Note that C max 
bounds credits for both temporarily idle VCs and busy VCs. 
Only idle VCs have the credits bounded explicitly by bucket 
size restrictions. The table shows that busy VCs also have 
their credits bounded, thereby showing that the algorithm is 
performing well. The value of LC max can be considered a 
credit bound for VCs which are "usually busy". The small 
bounds C max , LC max give rise to a useful contract: 

1. Any VC v will have its credit bounded C^tJ^C^ for 
all time t. In other words, at any time t, the total number of 
transmissions will lag behind its reserved share by at most 

cells. 

2. Any VC v will have LC v (t), its number of validated 
cells, bounded by LC max . 

3. The above two points are only guaranteed provided that 
the VC agrees to completely forfeit, without any 
compensation, any credit it would have received while its 
queues are empty and its credit already exceeds its bucket 
size B v . 

One way to understand the effect of bucket size is to 
understand when the VC does not need to worry about it. A 
VC need not worry about losing credits unless it is idle for 
a long time. A more precise mathematical statement is that 
if the VC is busy enough that for any time t, the total number 
of arrivals up to time t (i.e., 

L, 5>(r)| 



is at least txg^B^ then it will never lose credits due to 
bucket size restrictions. 

Another way to understand the effect of bucket size is by 
simulation and measurement. Our simulation tracks the 
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number M^t) of credits a VC has forfeited due to bucket size 
restrictions. In Table 3, M max shows the largest M v (t) by any 
VC. 

Another measure of interest is 

5 LCM^O-mina^t), C v (t)+M v (t)). (10) 

This is the number of validated cells the VC would have if 
it had an infinite bucket size (and hence would never lose 
credit). From our simulations, the bounds M max> LCM max 

10 are not negligible. Thus, the bucket size has a real effect and 
any VC that agrees to such a contract with a bucket size 
provision must understand the implications. If the VC is 
known to be extremely bursty, it might need to negotiate for 
a better contract, one with a large bucket size or even no 

15 bucket size restriction (B v =») or a higher than necessary 
reserved rate (g^>\) 

Since the edge weights change up or down by at most one 
every timeslot, the sort order can be maintained from one 
timeslot to the next with a one -pass linear updating proce- 

20 dure. Complexity is therefore 0(1^), 

Since bucket size restrictions only apply to temporarily 
idle VCs, it is not clear a priori that the algorithm will bound 
credits for busy VCs. However, Theorem 4 in the appendix 
proves that if a<V4 and each VC has a finite bucket size, then 

25 all VCs, both busy and idle, have credits bounded. This is 
true for arbitrary cell arrival patterns. Again, while the 
theory only guarantees loose bounds at a<V4, simulations 
show a much better performance of small bounds at a=90%. 
Using the concept of validated cells, the validation time of 

30 a cell is defined as the time the cell obtains a matching credit, 
e.g., if a VC has C v (t)>Lv(t), then the next cell to arrive will 
be validated immediately at arrival, whereas if a VC has 
C^t^L^t), the next cell to arrive will be validated only 
when the VC obtains a matching credit for this cell. 

35 Let D denote the delay between a cell's validation time to 
the time when it is transmitted. Any credit bound C maxf 
theoretical or experimental, provided by an embodiment of 
the present invention implies a theoretical or experimental 
delay bound respectively. Any cell of VC v will have its 

40 delay T)^\C ma JgJ\. This is because if a cell is not served 
within this time, another C max credits would have arrived 
which, together with the cell's matching credit, would 
exceed the C max bound. Note that this applies to the bucket- 
credit-weighted algorithm as well, because as long as the 

45 cell under consideration has not been served, the queue is 
non-empty and so credit bucket restrictions do not apply. 
LC-weighted algorithm 

Recall that the credit -weighted algorithm on non- 
backlogged traffic lets idle VCs collect arbitrarily large 

50 number of credits. When such a VC becomes busy again, it 
suddenly has a very high number of credits and hogs its 
input and output ports for a long time. While the bucket- 
credit algorithm discussed above is a refinement to the 
credit-weighted embodiments, other preferred embodiments 

55 take a radically different approach whereby the number of 
validated cells, LC v (t), rather than the number of credits, is 
used as an edge weight. The algorithm keeps track of both 
C^t) and L^t) for all VCs, and assigns edge weights to 
either a buffer's length, or to the number of credits associ- 

60 ated with the buffer, whichever is less, i.e., w(e v )=LC v (t)» 
min(L v (t), C v (t)). 

This approach is illustrated in FIG. 11, which shows the 
top half of a 2x2 switch 80. Buffer Q u holds six cells so 
Li! =6, however there are only three credits in the credit 

65 buffer CB n . Therefore, only three cells 87 A can be matched 
to the three credits, as shown by the arrows 88A. These three 
cells 87A are thus "validated". 
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The second queue Q 12 holds only three cells, although 
there are six credits available in the credit buffer CB 12 . All 
three cells 87B can thus be validated, again as indicated by 
arrows 88B. Thus, determining the number of validated cells 
is equivalent to taking the minimum of the queue length L,-, 
and number of credits C tJ , as shown in weighting function 
81C. 

In a preferred embodiment, bucket sizes are not used. 
However, because the manners in which credits are managed 
and edge weights assigned are independent issues, an alter- 
native LC-weighted embodiment does employ bucket sizes. 
In this case, the bucket is applied to outstanding credits, not 
validated credits. 



TABLE 4 



Traffic Type 


Smax 


a 








Bernoulli 


0.6 


90% 


369 


13 


404 


Bernoulli 


0.6 


S0% 


350 


4 


242 


Bernoulli 


0.2 


90% 


314 


7 


333 


2-state 


0.6 


90% 


616 


29.8 


671 


2-state 


0.6 


50% 


736 


6 


619 


2-stat* 


0.2 


90% 


418 


S.8 


423 



20 



Table 4 shows simulation results. The bound LC max gives 
rise to the following contract: 

At any time t, any VC v will have its number of validated 
cells LC v (t), i.e., the number of cells that already have 
credits and are simply waiting due to scheduling conflicts, 
bounded by LC max . 

It might not be immediately clear what the contract 
means, however. Hence it is necessary that the meaning of 
a bound on LC v (t) be explained in more practical, customary 
and intuitive terms. 

The main observation is that since LQ^min^, C v ), if 
LC V is bounded, then at least one of L v or C v is bounded. 
These two cases have quite different interpretations, 
rephrased in the contract below: 

At any time t, for any VC v, 

1. If the VC has a large queue (L v (t)>LC maA ) then its 
credits must be bounded (C v (t)^LC max ). In other words its 
total number of transmissions logs behind its reserved share 
of txg v cells by at most a small constant number of cells 
LC max . The VC is already transmitting at very close to fall 
reserved rate. Such a VC can be considered to be "over- 
loading" since L v (t)>C v (t). 

2. On the other hand, if the VC has a lot of credits 
(C^(t)>LC maw ) then its queue size is guaranteed to be small 
(L v (t)^LC mflX ). So, its total number of transmissions lags 
behind its total number of cells (which is, of course, the 
maximum number of transmissions possible) by at most a 
small constant hC max . Such a VC can be considered to be 
"underloading" since L v (t)<C v (t). 

In short, "overloading" VCs have few unspent credits, and 
"underloading" VCs have short bounded queues. Both of 
these cases represent practical, useful contracts. 

Table 4 also lists the maximum queue size h max and 
maximum credit size C^. Even though L max is relatively 
large, the first scenario above implies these VCs are already 
transmitting at full reserved speed. In addition, even though 
is relatively large, such VCs must have very short 
queues, by the second scenario. Note that in the original 
credit-weighted algorithm, such VCs are the ones that hog 
input/output ports. In the LC-weighted algorithm, however, 
they have small edge weights and do not cause any trouble 
at all. 

Since the edge weights change up or down by at most one 
every timeslot, the sort order can be maintained from one 
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timeslot to the next with a one-pass linear updating proce- 
dure. Complexity is therefore 0(1^). 

We conjecture that if a< l /2, then the contract is satisfied, 
i.e., LC„(t) is bounded all VCs, both busy and idle. One 
s reason is that if a VCs arrival rate Xv>g v , then in the long 
term it becomes constantly backlogged with LC^O-^C^t), 
where if >v<g v > tnen m tne l° n g term LC^t)-*!^) and this 
becomes the scenario of Tassiulas, and McKeown, 
Anantharam and Walrand. Again, simulation results exceed 
10 the conjectured performance. 
Validated-waiting-time algorithm 

As defined, the above LC-weighted algorithm suffers 
from an undesirable starvation problem. Suppose a VC goes 
into a prolonged non-busy period between t=*Tj and t»T 2 , 
15 and accumulates many credits during that time. The last cell 
arriving just before toT a will experience a long delay. 
Throughout the entire period T 1 ^t^T 2 , the queue length 
1^(9=1 (this is the last cell before the period) and so, 
although credit keeps increasing, LCj(t)=l . This gives the 
VC very low edge weight and hence very low priority. This 
starvation problem is common in most queue-length based 
algorithms. 

A preferred embodiment of the present invention fixes this 
problem by keeping track of waiting times, or ages, of each 
cell, and having an exception handling mechanism kick in 
when the waiting time is too large, to decide whether to 
select the associated buffer. In another embodiment, "phan- 
tom" cells arrive to flush out the real cells during long idle 
periods, i.e., to increment L^t) even though there are no true 
cell arrivals. 

In yet another preferred embodiment, queue lengths are 
not used at all. Rather, validated waiting times associated 
with the oldest cells of each buffer are explicitly used. Cells 
are validated when there is a credit available, and the 
validation time is recorded. The validating waiting time is 
then calculated from the current time and the validation 
time. 

Recall that W v (t) denotes the waiting time or delay of the 
oldest input-queued cell of v, measured in units of timeslots. 
By convention, a cell arriving in the current timeslot has the 
minimum waiting time of one, a cell that arrived in the 
previous timeslot has a waiting time of two, etc. Also by 
convention, W v (t)«0 if the queue is actually empty. Thus, if 
the oldest cell still queued arrived at time t fl , then at time t, 
its waiting time is 1+t-t,,. 

Mekkittikul and McKeown, "A Starvation-free Algorithm 
for Achieving 100% 'Throughput in an Input-Queued 
Switch," ICCCN 1996, proved that if the scheduling algo- 
rithm uses maximum weighted matchings with W v (t) as 
edge weights, then E[W^(t)] is bounded. We have found an 
appropriate generalization using only stable marriage 
matchings in the context of bandwidth reservations. 

This generalization is obtained by considering the "vali- 
dation time" of a cell. More precisely, recall that a queued 
cell is "validated" if there is an existing credit earmarked for 
it. The cells of a VC are assumed to be served in order of 
arrival. Any cell that arrives first must be validated first and 
also transmitted first. 

Suppose a cell c arrives at time i a . If at that instant, there 
60 are available credits, then the arriving cell can be immedi- 
ately validated by an available credit. In this case, the 
validation time of the cell c is defined to be equal to its actual 
arrival time t a . If however, at the moment of arrival, a cell 
cannot be immediately validated, then its validation time is 
65 whenever the VC accrues sufficient credit to validate it. 
For instance, suppose a new cell arrives at a queue and 
finds that there are two credits and ten cells ahead of it. Of 
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these ten cells, the oldest two are validated since there are 
two existing credits. The remaining eight cells are not yet 
validated. Nine additional credits must be accrued before the 
new cell can be validated. The first eight of these credits go 
to validate cells already in the queue, and the ninth credit 
will validate the new cell. Depending on the exact arrival 
time in relation to the credit stream, the validation time will 
fall between t a +8x v and t a +9x v , where T v =l/g„ is the time it 
takes to accrue one credit. 

With this definition we now define the validated waiting 
time VW^t) in analogy to actual waiting time W v (t) by 
replacing the actual arrival time with the validation time. 
Consider the oldest queued cell c of VC v at time t. If the VC 
has an empty queue, i.e., c does not exist, or if c has not been 
validated, i.e., C v (t)=0, then VW v (t)=0. Otherwise VW v (t)= 
l + t_t va/l . rf where l va[id is the validation time of cell c. 

The following equivalent definition is perhaps computa- 
tionally more useful (depending on the exact implementa- 
tion of time -stamping): 

VW^-minfW^), T\(t)) (11) 

where Tv(t) is the age (actual waiting time) of the oldest 
credit of the VC. This is because the oldest credit is 
automatically earmarked for the oldest cell, so the validated 
waiting time is the more recent (minimum) of the oldest 
celPs waiting time and the oldest credit's waiting time. 
Since credits arrive in a smooth stream, the quantity T\,(t) 
might be easy to calculate. For example, if g w =Vfc then credits 
arrive every 5 timeslots, in fact, at timeslots t=5, 10, 15, and 
so on. Thus, if the current time is t=43 and C v (t)=3, the 
oldest credit must have arrived at time t=30 and 1^^0=1 + 
43-3014. 

TABLE 5 

Validated- waiting-time algorithm 
Traffic type a VW^ (timeslots) W m „ (timeslots) 
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25 



30 
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for each cell in the queue. Simulation results are shown in 
table 5. VW mm: is the largest VW\,(t) for all v and all t and 
again it acts as a practical "soft" bound in our contract: 
At any time t, any VC v will have its validated waiting 

time VW v (t) bounded by VW mflJc . 
When the actual waiting times W v (t) are bounded, that 
means individual cell delays are bounded. The algorithm of 
the current embodiment only bounds validated waiting times 
VW v (t). What this means, in more customary and intuitive 
terms, is the following: 
At any time t, for any VC v, consider the oldest cell c still 
in the input queues. Suppose this is the k rt cell of this 
VC ever to arrive, and let i a be its actual arrival time 
(thus t^t). 

1. If cell c arrived at the same timeslot as its correspond- 
ing credit or later (t fl ^kxr v ), or equivalently if the total 
number of cells arrived zip to t a (including cell c) is equal 
to or less than the guaranteed share 
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then c will be validated at once and will have its actual delay 
bounded by VW mar timeslots. (The VC is "underloading" in 
this case.) 

2. On the other hand, if cell c arrived before its corre- 
sponding credit, it will have to wait (say, for a duration of t 
timeslots) before validation. Its actual delay l§ VW majc +Tbut 
t is not bounded. However, in this case the actual transmis- 
sions lags behind the accrual of credits by at most VW max 
timeslots, or equivalently, at most VW max xg v cells. (More 
precisely, 



35 



Bernoulli 


0.6 


90% 


397 


45 


1830 


Bernoulli 


0.6 


50% 


322 


4.3 


1740 


Bernoulli 


0.2 


90% 


292 


35 


3520 


2-state 


0.6 


90% 


739 


77 


4750 


2-state 


0.6 


80% 


480 


5.5 


1050 


2-state 


0.2 


90% 


389 


48 


3330 



FIG. 12 illustrates a preferred embodiment of the present 
invention which uses validated waiting times, VW^t) as 
edge weights. Input circuits 102, 121 for virtual circuits, 
VC1 and VC2 are shown within dashed lines. The credit 
buffer 103 for VC1 has six credits 107. There are three cells 
111 in the queue 105, all of which have been validated, as 
indicated by the arrows 109. A separate buffer 115 tracks the 
validated waiting times 117 of the validated cells 111, as 
indicated by arrows 113. The validated waiting time 118 of 
the oldest cell, here equal to eleven timeslots, is used as the 
weight w^j, as indicated by the arrow 119. 

The credit buffer 123 for VC2, on the other hand, has only 
three credits 125, while there are five cells in the corre- 
sponding queue 129. Since there are only three credits 125, 
only three cells 131 are validated, indicated by arrows 127. 
Again, a separate buffer 135 tracks the validated waiting 
times 137 of the validated cells 131, and again, the validated 
waiting time 139 of the oldest cell, here equal to eight 
timeslots, is used as the weight W^, as indicated by the 
arrow 140. 

Note that this embodiment requires substantially more 
bookkeeping than the credit- or LC-weighted algorithms, 
since we must now keep track of the individual time stamps 



So the VC is transmitting at very close to full reserved 

40 bandwidth already. (The VC is "overloading" in this case.) 
Both of these cases represent practical, useful contracts. 
Table 5 also lists the maximum actual waiting time W m(tt 
and maximum credit size C max . Even though W m(te is 
relatively large, the overloading scenario above implies 

45 these VCs are already transmitting at full reserved speed. 
Also, even though C max is relatively large, cells of such VCs 
must experience small delay according to the above under- 
loading scenario. 

If a cell is not transmitted, its edge weight will increase by 

50 one in the next timeslot. If a cell is transmitted, however, the 
next cell's waiting time can be arbitrarily smaller, depending 
on the inter-arrival duration. Thus, edge weights can change 
by arbitrary amounts every timeslot. The stable marriage 
matching algorithms will require a sorting pre-processing 

5S step and complexity is therefore 0(N^ log N). 

As a final observation, we have found that validated 
waiting time may be estimated based on actual waiting time 
of the oldest cell, the number of credits associated with the 
buffer, and the rate at which credits accrued. In particular, 

60 

VWV(t)-min(W w (t), C v (t)xr,) (12) 

is a reasonable approximation to VW v (t). The reason is that 
C^t)>a3^vX0-(Cv(0-l) XT v because credits arrive in a 
smooth stream. So comparing the two equations, VW' v (t) 
65 overestimates VW v (t) but their difference is at most and 
usually much less. For slightly more accuracy, VW"=min 
(W, (C^-V^xt) can be used. 
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Waiting time in units of "tolerable delay" 

The validated-waiting-time algorithm bounds every VCs 
validated waiting time to the same bound VW^, regardless 
of their guaranteed rates or their tolerance for delay. In some 
applications, however, different VCs may have different 
tolerance for delay. In that case, it may be more useful to 
give a bound/contract of the following form: 

At any time t, for any VC v, validated waiting time of v 
is bounded by KxD v for some small constant K. 

Every VCs validated waiting times is still bounded, but 
the actual bound is a constant K multiple of each VC*s delay 
tolerance parameter D y , and thus it is different for different 
VCs. Thus, the validated waiting time is scaled by a constant 
(l/kD v ) which is inversely proportional to the predetermined 
tolerable delay D v , The delay tolerance D v is yet another 
parameter that can be negotiated during start-up. VCs with 
stringent delay requirements must try to obtain a small D v . 
This decouples the rate guarantee from the delay guarantee. 

The validated-waiting-time algorithm is easily modified 
for this feature by substituting VW v (t)/D v for VW,(t) as edge 
weights. In simulations, the algorithm is still observed to 
bound the new edge weights to a constant, satisfying the 
contract. The size of the bound, and hence the usefulness of 
the contract, depends on the relative sizes of g v) D v , and the 
product gvXD^ for different VCs. The exact dependence is 
not completely understood yet. 

In a preferred embodiment, D v =r v =l/g v , i.e., the tolerable 
delay is the inverse of the guaranteed bandwidth. In other 
words, the network management does not negotiate rate and 
delay guarantees separately but instead mandates that slower 
VCs, i.e., those with small g v , must tolerate proportionally 
larger delay. In this case the validated-waiting-time algo- 
rithm's performance is similar to the LC- weighted algo- 
rithm. This is not surprising, since measuring delay in 
multiples of credit periods x v should be similar to measuring 
number of outstanding credits, because credits arrive in a 
smooth stream. However, using validated-waiting-time in 
units of has an advantage over using LC v (t). The former 
does not suffer from the starvation problem discussed pre- 
viously. 

Fair Sharing of Unreserved Switch Capacity 

By design, the bandwidth reservation algorithms dis- 
cussed thus far only serve a VC when it can pay the required 
credit. Since reserved bandwidth usually does not make up 
100% of network capacity, the algorithms are not "work- 
conserving" and lead to under-utilization. How then should 
the unreserved capacity of the network be used? Various 
embodiments, corresponding to step 209 of FIG. 3, are now 
presented which achieve near-maximum utilization and fair 
sharing of the unreserved capacity, by selecting buffers 
according to a second matching between remaining inputs 
and outputs. 

The notion of max-min fairness is applied to the unre- 
served capacity of the network resources as described by 
Dimitri Bertsekas, Robert Gallager 7 £)ara Networks, 2nd ed., 
published by Prentice Hall 1992, p.524. The resources 
required to support bandwidth reservations are exempted 
from fairness considerations, but the leftover resources must 
be shared fairly by all VCs. Max-min fairness is a rate-based 
notion. The term "excess rate" is used to denote a VCs 
transmission rate in excess of its guaranteed rate, if any. 

A set of VC excess rates, measured in cells/timeslot, is 
"max-min fair" if and only if every VC has at least one 
botdeneck resource. A resource is a bottleneck resource for 
a particular VC if (a) that resource is fully utilized, and (b) 
that VC has at least as high an excess rate as any other VC 
using that resource. 
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As an example, Tables 6-1 through 6-3 shows five VCs in 
an N=3 switch. Each VC has a different source-destination 
combination, corresponding to its position in each matrix. 
All numbers are transmission rates in cells/timeslot. Table 

5 6-1 shows the guaranteed rates granted to the VCs. Two of 
the five VCs have g v =0 and they represent best-effort traffic. 
Input 1, which can support a maximum rate of one cell/ 
timeslot, must use 0.5 of that capacity to support the 
guaranteed transmissions of the two VCs to Outputs 1 and 
2, and therefore only has an excess rate of 0.5 cells/timeslot 
available for fair sharing. Similarly, output line 2 (column 2) 
must use 0.4 of its rate of one cell/timeslot to support 
guaranteed traffic, leaving only 0.6 for sharing. 

15 Using these excess rates, the max-min fair shares of the 
excess rates are shown Table 6-2. The VCs in the 2nd 
column have an output line bottleneck and are limited to an 
excess rate of 0.6/3=0.2 each, while the other two VCs are 
limited by their respective input lines as bottlenecks. The 

20 total rate of each VC is its guaranteed rate plus its fair share 
of excess bandwidth, shown in Table 6-3. 
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1\vo Phase Usage Weighted Algorithm 

FIG. 13 illustrates yet another preferred embodiment 
which employs an algorithm that operates in two phases in 

ss each timeslot, where the weights in the second phase are 
based on useage. Input circuits 202 and 203 corresponding 
to two VCs, VCj and VC 2 respectively are shown within a 
switch 201. In a first phase, any of the previously described 
bandwidth reservation algorithms, represented boxes 204 

60 and 205, are used to calculate a weight w l7 w^ etc. for each 
VC. The weights w 1( W 2 , etc. are presented to a first phase 
scheduler 209, which produces a matching X 213 as previ- 
ously described. The VCs included in the matching X have 

65 their credits decremented as usual. 

Now, the matching X is presented to a second phase 
scheduler 211. If |X|<N, i.e., if additional transmissions are 
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possible, additional VCs are chosen during a second phase 
by the phase two scheduler 211, to fill up the transmission 
schedule. These VCs have "usage" variables U„ which, 
initially set to zero, are incremented by one under control 
219 of the phase two scheduler 211. Thus each U v counts the 
number of cells a VC has sent but for which no credits have 
been paid, i.e., the number of "excess" or unpaid transmis- 
sions. By design, the greedy reservation algorithms of the 
present invention never miss a busy VC which has a credit 
to spend. Therefore, all of the VCs chosen during the second 
phase have no credit, i.e., C v =0. 

The idea is that to be fair, VCs which have few previous 
excess transmissions (small U v ) should be considered first in 
the sharing of excess resources. The second phase Usage 
Weighted Algorithm implements this directly as follows. 
Each VC is considered in increasing U v order, and is added 
to the matching X if possible. Otherwise, the VC is skipped. 
There is no backtracking. This is identical to the central 
queue algorithm except that the weights U v are sorted in 
increasing order, and the initial matching X 213 is computed 
in the first phase. The resulting matching Y 215 thus 
specifies transmissions across the switch fabric 214 for the 
current timeslot. 

Usage-Credit Weighted Algorithm 

FIG. 14A illustrates still another preferred embodiment 
241, in which the (bucket-)credit weighted algorithm and the 
usage-weighted algorithm are combined into a single usage- 
credit weighted algorithm. In this algorithm, portrayed as a 
circuit 242, all VCs are sorted by the difference UC V =U V -C V . 
Credits 243 are maintained as usual by incrementing C^ at 
regular intervals of x v timeslots via control 244, and decre- 
mented via control 245 when a cell is validated, i.e. the 
credit is paid for. Usage variable U v 246 is maintained as 
described above by incrementing it via control 247. At 248, 
C v is subtracted from U v , effectively applying credits "ret- 
roactively" to previous excess transmissions, which are now 
accounted for as guaranteed transmissions. The resulting 
edge weights UC V 249 measure the number of excess 
transmissions under this accounting scheme. 

AUC-scheduler 250 considers all VCs in increasing UC V 
order in a single pass by dealing with bandwidth reserva- 
tions and fair sharing together. When a VC transmits a cell, 
its C v is decremented if C^O originally, otherwise its U v is 
incremented. In either case, its UC,, will therefore be 
increased by one in the next timeslot. The output of the 
scheduler 250 is a matching X 251 which is applied to the 
switch fabric 253. 

FIG. 14B illustrates a practical optimization in which the 
algorithm tracks just one variable, UC V 233, per VC. This 
variable 233 is incremented via control 234 when a cell is 
sent, and decreased via control 235 when credits are 
received. 

If bucket sizes are used, credits C v must still be tracked 
separately. This is illustrated in FIG. 14C in a sample 
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schematic. Here, the number of credits C v 275 for a given 
VC v is compared at comparator 286 with the bucket size B v 
283. The output of the comparator 286 is ORed at OR gate 
285 with a signal 281 which indicates that the corresponding 

5 queue is empty (L v «=0), which is inverted at the input to the 
OR gate 285. The output 293 of OR gate 285 is ANDed with 
the normal credit increment control 279 to act as a gate to 
allow or disallow accruing of additional credits. As with the 
embodiment of FIG. 14A, the resulting value of C v 275 is 

10 subtracted from the usage count U v 251 to form a usage- 
credit weight 295. 

Since VCs are considered in increasing UC V order, those 
VCs with negative UC V are considered first. For these VCs, 
-UC V «C V -U V is positive and represents the amount of 
unspent credits. Thus the first part of the usage -credit 
weighted algorithm, using the central queue algorithm to 
choose VCs in increasing UC V order with most negative UC V 
first, is equivalent to the (bucket-)credit weighted algorithm, 

2Q using the central queue algorithm to choose VCs in decreas- 
ing C v order. 

Table 7 shows the performance of the UC-weighted 
algorithm in simulations. The total number of VCs shown in 
the table includes those with a non-zero bandwidth 

25 guarantee, and those with no bandwidth guarantee. The 
latter represents best-effort traffic. All VCs have random 
input and output ports, both chosen uniformly among the 
N=32 ports. The generation of each VCs guaranteed rate is 
done as before, subject to the same simple "admission 

30 control" of not loading any input or output beyond a. 
Backlogged traffic represents an overloading scenario where 
all VCs are constantly backlogged. When Bernoulli traffic is 
used, each VCs arrival rate equals its guaranteed rate plus 
a small constant. In each test case the small constant is 

35 adjusted so that the total arrival rate of all VCs equals N 
cells/timeslot, the highest possible throughput of the switch. 
Tliis represents an exact loading scenario. 

Table 7 shows that total switch throughput is usually very 
high. (For Bernoulli traffic, the throughput is affected by the 

40 arrival processes and therefore not very meaningful.) The 
algorithm's performance regarding fairness is measured by 
the parameter 5 V , defined as the ratio of a VCs excess 
transmission rate over its fair excess rate (computed offline). 
A VC which gets less than its fair share will have 6 V <1, and 

45 a VC which gets more than its fair share will have d„>l . The 
- table shows the distribution of all 8 V values and also the 
minimum value. It shows that many VCs (at least 85% of 
them) obtain at least 95% of their fair shares. However, a 
small fraction of VCs might be treated very unfairly (small 

50 6 V ) under some settings. The simulation results are similar 
for the two phase algorithm. In practice, the one-phase 
usage-credit weighted algorithm may be preferable to the 
two phase algorithm because of its simpler implementation 
and resulting faster running time. 



TABLE 7 



no. of % of VCs with 6 V in these ranges 

VCs with total min min Total 

non-zero no. of (6 V ) (S v) 0.7 0.85 0.95 switch 

Traffic type GBW VCs a value to 0.7 to 0.85 to 0.95 or more throughput 
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TABLE 7-contimied 



no. of 



% of VCs with ik, in these ranges 



Traffic type g m „ 



VCs with total min min Total 

non-zero no. of (&v) (6^ 0.7 0.8S 0.95 switch 

GBW VCs a value to 0.7 to 0.85 to 0.95 or more throughput 



backlogged 

Bernoulli 

Bernoulli 



0.2 
0.2 
0.2 



201 
204 
201 



2048 
1024 
2048 



50% 
50% 
50% 



0.161 
0.672 
0.614 



0.4% 
0.04% 
0.1% 



1.0% 
0.4% 
0.8% 



1.7% 
1.5% 
1.6% 



96.9% 
98.1% 
97.5% 



99.0% 



Multiple VCs per input-output pair 

In practice it is likely that several VCs have the same 
input-output pair, as illustrated in FIG. 15 A. Here, three 
VCs, VCj-VCg, all enter the switch 301 at input \N 1 and 
exit through OUT^ Similarly, VC 4 and VC 5 share input IN a 
and output OUT 2 . Lastly, VC 2 in unique in that it does not 
share an input/output pair with any other VC. 

In this case each VC has its own guaranteed rate. 
Moreover, providing rate and delay guarantees to each 
individual VC means that per-VC queuing, is required. 

One might be tempted to let the scheduling algorithm 
lump all such VCs together as one "super" VC with a 
combined guaranteed rate. However, among these VCs, the 
scheduling algorithm must then devise some separate round- 
robin or priority scheme to ensure each VC obtains its own 
rate and/or delay guarantee. 

FIG. 15 B illustrates a preferred embodiment of the 
present invention in which the scheduling algorithm keeps 
track of each VC, e.g. VC 1? VCj, VC 3 , etc., separately- 
separate C v , Lv, LC V , W w VW V , U v and whatever other 
parameters the scheduling algorithm requires, within the 
corresponding input circuits 309A-309C respectively. Each 
circuit 309A-309C produces a corresponding weight w a -w 3 
respectively. Conceptually, the bipartite graph becomes a 
multi-graph where there may be different edges, each rep- 
resenting a different VC, connected to the same two nodes/ 
ports. However, since the reservation algorithms of the 
present invention are greedy and are only interested in 
choosing high edge weights, a simple preprocessing 307 can 
trim the multi-graph into a normal graph, preferably by 
choosing the edge of highest weight between any input- 
output pair, resulting in, for example w n of FIG. 15B, which 
is then fed to the scheduler 303, as usual, to produce a 
matching X 304 which is applied to the switch fabric 305. 

Similarly, the fairness algorithms of the present invention 
can trim the multi-graph by keeping only the edge of lowest 
weight between any input-output pair. 

While this invention has been particularly shown and 
described with references to preferred embodiments thereof, 
it will be understood by those skilled in the art that various 
changes in form and details may be made therein without 
departing from the spirit and scope of the invention as 
denned by the appended claims. 

For example, the same principle of choosing appropriate 
edge weights and bounding them might be more widely 
applicable to achieve other kinds of QoS contracts, e.g., 
delay variation guarantees, fairness based on waiting times 
(unlike rate-based max-min fairness), etc. 

APPENDIX 

Proofs of Theorems 
Theorem 1 (Stable marriage matchings have half the maxi- 
mum weight) Given a weighted bipartite graph (U, V, E, w) 



with non-negative weights, a stable marriage matching X, 
and any other matching Y, the following inequality holds: 

15 w(X) ^ w(xn Y)+^ w(Y-xn Y) 

where W0 denotes the weight of a matching (i.e., sum of its 
edge weights), and the O notation denotes set intersection, 
i.e., XHY denotes all edges in both X and Y, and Y-XPlY 
20 denotes all edges in Y but not in X. 

In particular, since all edge weights are non-negative, we 
have: 

w(x) ^ w(xnY)+*4w(Y-xn Y) ^ w(xn Y>vw(Y-xn y)- 

HW(Y). 

25 Further, take Y=a maximum weighted matching and this 
theorem implies that any stable marriage matching has at 
least x h the maximum weight. 

Proof of theorem 1: Let X be a stable marriage matching and 
Y be another matching. Let Z=XHY (edges in both X and 
30 Y), &-X-Z (edges in X but not in Y), and t-Y-Z (edges 
in Y but not in X). We will prove the following equivalent 
statement: 

WW(Y)iW(X). 

35 The theorem will then follow by adding W(Z) to both sides 
and noting that W(Z)+W(£)=W(X), 

Since X is a stable marriage matching, every edge e?£ ¥ 
has a higher-or-equal- weight blocking edge in X, denoted by 
block (e?). Note: In case e ? has two higher-or-equal -weight 

40 blocking edges, we can assume (without loss of generality) 
that each edge has a numeric unique identifier (assigned 
arbitrarily, e.g., VC ID) and let block(e^) denote the one with 
a smaller unique identifier. 

Here are some simple properties of edges e^EY and their 

45 blocking edges: 

1. block(e?) G X and w(block(e?))^w(ef), by definition. 

2. blockfe^) £ Y, because Y, being a matching, cannot 
contain both e^> and its blocking edge. Combining 
block(ey) G X and block(e f ) $ Y, we have block(e f ) G 

50 -XOY-X. 

3. Any e^=(u, v) E X can only block at most two different 
edges in Y. This is because Y is an matching and 
contains at most one edge connecting to u and at most 
one edge connecting to v. 

55 Now let the edges of Y be explicitly listed as 
{c 1} e 2 , . . . e*}. We have: 

w^Swtblock^!)) 
w^SwOlock^) . . . 

60 

w^^wCbtockfe*)) 

Summing up all equations, the sum of the left sides^w 
(ej)+. . . +w(e n )=W(Y). On the right sides, every block(e<) 
g and any edge in X can appear at most twice, thus the 
65 sum of the right sides ^2xW(X). (Note that this uses the 
assumption that edge weights are non-negative.) This proves 
the required W(t)^2W(£). 
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Theorem 2 (Correctness of Central Queue algorithm) 

When the Central Queue algorithm terminates. M is a 
maximal weighted matching. 

Proof: Let M fifUll denote the value of M when the algorithm 
terminates. Now consider an edge e' <2 M^,. There are two 5 
cases, which together directly satisfies the definition of 
stable marriage matchings: 

1. The algorithm terminates before e' is considered. This 
can only happen when IM^J^N, so that there is some 
blocking edge e £ M fma{ that shares a common node io 
with e\ Because of the sort order, we have w(e)^w(e'). 

2. The algorithm has considered e' at some point. Suppose 
thai when e' is considered, the matching is M 1 <=■ M finaI . 
By design, the only possible reason why e' is not added 
is that Mj U{e'} is not a matching, or equivalently, 
there exists eEMj c such that e', e share a 
common node. However, e G Mj means that e has 
already been considered at that point, i.e., w(e)^w(e') 
because of the sort order. Therefore e' has a blocking 
edge e E Nl^, with higher or equal weight. 

Theorem 3 (Credit-weighted algorithm supports 50% reser- 
vations with backlogged traffic) 

If a<Vi and all VCs are constantly backlogged, then the 
credit- weighted algorithm guarantees that there is ai bound 
Cmax sucn tnat > at any time t, for any VC f, the VCs credit 
C/tJ^C^.. This result holds whether the edge weights are 
fractional credits or an integral approximation, i.e., [C/OJ- 
Proof: Assume all VCs are constantly backlogged, a<Y2 and 
the algorithm is used as described. We will prove that the 
quantity V^-Z/^t) 2 is bounded, which would imply all 
C/i) are bounded. This proof here is adapted from [1, 3] 
which (unlike this work) deal with maximum weighted 
matchings. 

Let Sy(t) denote the number of cells transmitted by VC f 
at time t. Then, Sy(t)=0 or 1, and the set of zero-one values 
*it f specifies the maximal weighted matching chosen 
at time t. Moreover, I>][C/l)Sfi)] is the total weight of this 
matching. Note that since the algorithm ignores VCs with 
Cy(t)^0, those VCs will automatically have S/t)=0. (This 
also ensures that no Cy(t) will drop below -1, since only 
positive Cy(t) are decremented, and they decrease at most by 
1.) Similarly, if several VCs have the same source- 
destination pair, only the one with highest credit is 
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The set {S^t)} specifies the chosen matching. The fol- 
lowing lemma, now relates % f and a to matchings: 

Lemma 1 (Convex Combination) Given that the g f values 
correspond to a certain reservation factor a, there exist K 
sets of zero-one values {S/}, {S/}, . . . , {S/)K such that: 

1. Each set {S/} (k-1, . , . , K) corresponds to an 
matching. In other words, within each set, at most one 
VC f with source u has S/=l (for any u£U) and at 
most one VC with destination v has S^l (for any v E 
V). 

2. The set {g^} can be written as a convex combination of 
these K sets with total weight at most a, i.e., there are 
coefficients P ls . . . , P x such tQat g/^JtPitS/ for anv f > 
and where each coefficient p^>0 and their § sum ~ 

Using this lemma we obtain: 



20 



25 



X[C/W(*/-J/M)] 
/ 

= Z[ c / (f} (Z (J*/ - s}um - - yr- js}(o | 

(^-^)Z c / (f)r / (f) 



(17) 



(18} 



(19) 



(20) 



(21) 



considered, and the others will have Sit)=0. . rt « . . ^ <~ 

Let {S,*(t)} be another set of zero-one values that specify 45 fact that each p*>0, implies all the X k terms are =< 
maximum weighted matching at time t. Then theorem 1 second term, denote QA-$ sum ) by 7, Note that y>0 



a maximum weighted matching ; 
states that 

SjIc/t^/tMawyc/os/W]. 
We have: 

C/t+l)-C/t)+g r S/t) 

V(r+l)-V(/) = ^[C / (f+l) 2 -C / (/) 2 ] 



Consider the last equation. In the S fr summation, the term 
2 / Cy(t)Sy k (t) is the weight of a maximum weighted match- 
ing. It is larger than or equal to each 2jCj(t)S f k term, which 
is the weight of some fixed matching. This, together with the 

0. In the 
because 



= £ [2Cf{rKg r - 5/(0) + ig f - Sjit)) 1 } 



I 



50 

(13) 

(14) 55 
(15) 

(16) 60 



%<a^p™ We have: 

Z (£>(')(*/ - J/ (0)1 * 0 - rZ c/(0^(f) 
/ / 

= -y x weight of maximum 
weighted matching 



(22) 



(23) 



Substituting this back into equation (16), we have 

V(/+ 1)- V(0 i 2£ [C s {t){g f -S f (r))] + K, 

s 

£ K\- 2yx weight of maximum 
weighted matching 



(24) 
(25) 



where the term I^g^S/t)) 2 has been bounded by some 
constant K 3 in the last inequality. This is possible because 
both g f (given constants) and S/t) (either 0 or 1) are 
bounded. 



65 We can finally prove that V(t)=2 / C / (t) 2 is bounded. The 
logic has two parts: First, since V(t+l)-V(t)^K 1( the V(t) 
value car only increase by a finite amount each timeslot. 
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Second, the largest edge weight at time t is at least 

VV(t)/number of VCs, and a maximum weighted matching £ [c f {t){ G f {t) -s f (t))] s (30) 
weighs at least as much as the largest edge weight. / 

Therefore, for large enough V(t), the maximum weighted /c 2 - 7 xwcight of maximum weighted matching 

matching will weigh more than the constant K J2y, so that 5 
V(t+1)-V(t)<0, i.e., V will decrease in the next timeslot. 
Thus the value of V can increase by at most Kj each 

timeslot, until it becomes too large and must decrease. v^+^-v^fK^K^-yxweight of maximum weighted match- 

Therefore V(t) is bounded for ail time. 11 So, each C/t) is in £ < 31 ) 



also bounded for all time. 



and the rest of the proof follows without change. 



Specifically, if V^V^^-no. of VCsx(K Jlif, then V must decrease in 
the next timeslot. Therefore the bound is V^SV^^f+Kj for all time. We claim: 

If integral edge weights [C/t) J are used, each edge weight 1. A method for scheduling transmission of cells through 

has changed at most by 1 and so the weight of any matching ^ a data switch having a plurality of inputs and outputs, 

(S, S*, S*) has changed by at most N. Such a bounded comprising the steps of: 

change can be absorbed by replacing the 0 in equation (22) . Jt , . , « 
, -wjr • -i , v r /ii providing, at each input, a plurality of buffers correspond- 
by a constant IC, similar to K, of equation (16). r . 6 \ ■ j . « . -i L u- 

, . . „ ; . , , , mg to the outputs, said buffers temporarily holding 

Theorem 4 (Bucket-credit- weighted algorithm supports 50% 

reservations for any traffic pattern) 20 ■ . , . rr 

within each timeslot, assigning credits to each butter 

If a<V4, and each VC f has a finite bucket size B„ then the according to a guaranteed bandwidth for that buffer; 

bucket-credit-weighted algorithm guarantees that there is a a we' ht to each buffer 

bound C max such that, for any arbitrary traffic arrival pattern, assigning a we lg o eac u er, 

at any time t, for any VC f, the VCs credit C/t) £ C^. This setti &g lhe wei S ht associated with each buffer based on an 

result holds whether the edge weights are fractional credits 25 accumulated number of credits associated with the 

or an integral approximation, i.e., [Cy(i)J. buffer; 

Proof: We will briefly outline how the previous proof can selecting buffers according to a weighted matching of 

be adapted. Define G/t) to be a VCs credit increment at in PU ts and outputs wherein each unselected buffer 

timeslot t, i.e., G/t)=0 (missed credit increment) if the VC 30 shares an in P ut or an out P ut with a selected buffer 

is idle and its Cfi)>B fi otherwise G£l)=g f (normal credit whose weight is greater or equal to the unselected 

increment). Then equation (13) becomes instead buffer's weight; 

transmitting cells from the selected buffers to the corre- 

c/t+iJ-CXO+G/O-s/t) (26) sponding outputs. 

35 2. The method of claim 1 wherein the matching is a 

and equations (14)-(16) still hold after replacing g f with maximal weighted matching. 

G/t). 3. The method of claim 1 wherein the steps of assigning 

Let F fc (t) be the set of VCs that are busy (non-empty weights, selecting buffers and transmitting cells are repeated 

queues) at time t, and let F ( <t) be the set of idle VCs. Since for consecutive timeslots. 

the algorithm ignores idle VCs, their C/t) do not contribute 40 4. The method of claim 1 wherein credits are assigned in 

to a matching' s weight, i.e., they are not among the positive integral units including zero. 

edge weights actually used by the CQ algorithm. The crucial 5. The method of claim 1 wherein the weight associated 

observation is this: the weight of matching {S f } is given by with a buffer is zero if the buffer is empty, regardless of 

2 /GF fr (o C X t ) S / t )' WDere tne summation only includes busy 45 actual credit. 

VCs, not idle VCs. Based on this observation, we can rewrite 6. The method of claim 1 wherein a credit bucket size is 

the left hand side of equation (18) as assigned to each buffer, such that if a buffer is empty and has 

a number of credits exceeding its associated credit bucket 

Y [c f {r){G f {T)-s f {i)}] t 27 ) si zc » tnc buffer receives no further credits. 

/ 5Q 7. The method of claim 1 further comprising the step of 

setting each weight associated with a buffer to either the 

= Yj [C/MCC/tO -•?/('))]+ [Cf(0(O f {t)-s f (t))] < 28 ) buffer's length, or to the number of credits associated with 

f*fy<i W> the buffer. 

„ (29) 8. The method of claim 1 further comprising the step of 

= 2j ( EC/«(G/(0-S/<O)] + (*2-O) ^ setting each weight associated with a buffer to either the 

/cFftt0 buffer's length, or to the number of credits associated with 

the buffer, whichever is less. 

where the term 2^ (0 [C/t)S/t)] -0 (no service possible, 9 * ^ method of claim 8 &rther comprising: 

i.e., S/t)=0, for idle Vcs), and the term Z^JC/OG/t)] maintaining an age for each cell; and 

has been bounded by some positive constant K^, because 60 if the age for some cell exceeds a predefined threshold for 

idle VCs either have bounded C^t) (bucket size restriction) the corresponding buffer, employing an exception 

or Gy(t)=0 (no credit increment). The remaining term 2^ mechanism to decide whether to select the buffer. 

(o[C/t)GXt)] can now be treated just like l^C/tJgy] of 10. The method of claim 8 further comprising the step of 

equations (18)-(23). In particular, at any time t, the set flushing out cells during long idle periods. 

{G/t)}^^ can still be written as a convex combination of 65 11. The method of claim 1 further comprising the step of 

some matchings in the style of lemma 1 (since Gy(t)§gy). setting each weight associated with a buffer to a validated 

Thus equations (23) and (25) simply become waiting time associated with an oldest cell in the buffer. 
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12. The method of claim 11 further comprising the steps 

of: 

validating a cell when there is a credit available; 
recording the time of validation for each cell; and 
calculating the validated waiting time for each cell based 
on the current time and the validation time. 

13. The method of claim 11 wherein the validated waiting 
time of an oldest cell is calculated as a minimum of actual 
waiting time of the oldest cell and an age of an oldest credit Q 
associated with the buffer. 

14. The method of claim 11 wherein the validated waiting 
time of an oldest cell is estimated, the estimation being 
based on actual waiting time of the oldest cell, a number of 
credits associated with the buffer, and the rate at which 5 
credits are accrued. 

15. The method of claim 11 wherein each validated 
waiting time associated with a buffer is scaled by a constant 
which is inversely proportional to a predetermined tolerable 
delay. 20 

16. The method of claim 15 wherein the predetermined 
tolerable delay is the inverse of the guaranteed bandwidth 
associated with the buffer. 

17. The method of claim 1 wherein the data switch is a 
crossbar switch. 25 

18. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 
ing to the outputs, said buffers temporarily holding 30 
cells; 

assigning a weight to each buffer: 

selecting buffers according to a weighted matching of 
inputs and outputs wherein each unselected buffer 
shares an input or an output with a selected buffer 35 
whose weight is greater or equal to the unselected 
buffer's weight; and 

transmitting cells from the selected buffers to the corre- 
sponding outputs; 4o 

at each timeslot, computing a matching and a correspond- 
ing total edge weight; 

comparing a total edge weight of a current matching with 
an immediately preceding matching; and 

selecting the matching with a larger corresponding edge 45 
weight. 

19. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 50 
ing to the outputs, said buffers temporarily holding 
cells; 

assigning a weight to each buffer; 

selecting buffers according to a weighted matching of 
inputs and outputs as determined by a stable marriage 55 
algorithm, wherein each unselected buffer shares an 
input or an output with a selected buffer whose weight 
is greater or equal to the unselected buffer's weight; 

transmitting cells from the selected buffers to the corre- 6Q 
sponding outputs; 

providing fairness in allocating leftover bandwidth by 
determining a second matching between remaining 
inputs and outputs; 

selecting buffers according to the second matching; and $5 

transmitting cells from the selected buffers to the corre- 
sponding outputs. 



20. The method of claim 19, wherein max-min fairness is 
provided. 

21. The method of claim 19, wherein, during a second 
phase of weight assignments, additional paths are chosen 
based on usage weights. 

22. The method of claim 19, wherein allocating leftover 
bandwidth fairly comprises assigning weights based on 
usage and credits. 

23. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 
ing to the outputs, said buffers temporarily holding 
cells; 

assigning a weight to each buffer; 

selecting buffers according to a weighted matching of 
inputs and outputs wherein each unselected buffer 
shares an input or an output with a selected buffer 
whose weight is greater or equal to the unselected 
buffer's weight; 

transmitting cells from the selected buffers to the corre- 
sponding outputs; 

providing, at each input, a buffer for each virtual 
connection, wherein several virtual connections share 
the same input-output pair, each virtual connection 
having its own guaranteed rate; 

for each input/output pair, determining which virtual 
connection within the input/output pair has a maximum 
weight; 

assigning the respective maximum weight to the corre- 
sponding input/output pair; 

selecting input/output pairs based on the assigned 
weights, and according to a maximal weighted match- 
ing; and 

transmitting cells from the selected inputs to the corre- 
sponding outputs. 

24. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 
ing to the outputs, said buffers temporarily holding 
cells; 

assigning a weight to each buffer; 

selecting buffers according to a weighted matching of 
inputs and outputs wherein each unselected buffer 
shares an input or an output with a selected buffer 
whose weight is greater or equal to the unselected 
buffer's weight, the matching being a maximal 
weighted matching determined by using a stable mar- 
riage algorithm; and 

transmitting cells from the selected buffers to the corre- 
sponding outputs. 

25. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 
ing to the outputs, said buffers temporarily holding 
cells; 

assigning a weight to each buffer; 

selecting buffers according to a weighted matching of 
inputs and outputs, as determined by a stable marriage 
algorithm each unselected buffer sharing an input or an 
output with a selected buffer whose weight is greater or 
equal to the unselected buffer's weight, and buffers 
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having a greatest weight being selected first, followed 
by buffers having a next greatest weight, and so on, 
until buffers having a least positive weight are 
assigned; and 

transmitting cells from the selected buffers to the corre- 5 
sponding outputs. 

26. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 10 
ing to the outputs, said buffers temporarily holding 
cells; 

assigning a weight to each buffer; 

selecting buffers according to a weighted matching of 15 
inputs and outputs wherein each unselected buffer 
shares an input or an output with a selected buffer 
whose weight is greater or equal to the unselected 
buffer's weight; 

providing a data structure of linked lists, each list being 20 
associated with a weight, each list holding references to 
buffers having said associated weight, and each list 
having links to next and previous lists associated 
respectively with weights one greater and one less than 
the subject list's associated weight; 25 

placing each buffer reference in a list associated with the 
weight of the buffer; 

upon incrementing a buffer's weight by one, moving its 
reference from its current list to the next list, and upon 
decrementing an buffer's weight by one, moving its 
reference from its current list to the previous list; and 

for each list, in order of descending associated weight, 
selecting buffers which do not share input or output 
nodes with buffers which have already been selected; 35 
and 

transmitting cells from the selected buffers to the corre- 
sponding outputs. 

27. A method for scheduling transmission of cells through 

a data switch having a plurality of inputs and outputs, 40 

comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 
ing to the outputs, said buffers temporarily holding 
cells; and 

within each timeslot, 45 
assigning credits to each buffer according to a guaran- 
teed bandwidth for that buffer, 
assigning a weight to each buffer as a function of 
credits, 

determining a matching of inputs and outputs based on 50 
the assigned buffer weights and selecting buffers 
according to the matching, and 

transmitting a cell from each of the selected buffers to 
their corresponding outputs and removing a credit 
from each of the selected buffers. 55 

28. The method of claim 27 wherein assigning weight to 
each buffer further comprises setting the weight to an 
accumulated number of credits associated with the buffer. 

29. The method of claim 28 wherein the weight associated 
with a buffer is zero if the buffer is empty, regardless of 60 
actual credit. 

30. The method of claim 28 wherein a credit bucket size 
is assigned to each buffer, such that if a buffer is empty and 
has a number of credits exceeding its associated credit 
bucket size, the buffer receives no further credits. 65 

31. The method of claim 28 further comprising the step of 
setting each weight associated with a buffer to either the 
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buffer's length, or to the number of credits associated with 
the buffer, whichever is less. 

32. The method of claim 31 further comprising: 
maintaining an age for each cell; and 
if the age for some cell exceeds a predefined threshold for 

the corresponding buffer, employing an exception 
mechanism to decide whether to select the buffer. 

33. The method of claim 31 further comprising the step of 
flushing out cells during long idle periods. 

34. The method of claim 31 further comprising the steps 
of: 

validating a cell when there is a credit available; 
recording a validation time; and 

assigning a weight associated with a buffer equal to a 
validated waiting time associated with an oldest cell in 
the buffer. 

35. The method of claim 28 wherein credits are assigned 
in integral units including zero. 

36. The method of claim 27 wherein the matching is 
determined by using a stable marriage algorithm. 

37. The method of claim 27 wherein buffers having a 
greatest weight are selected first, followed by buffers having 
a next greatest weight, and so on, until buffers having a least 
positive weight are assigned. 

38. The method of claim 27, further comprising the steps 
of: 

providing a data structure of linked lists, each list being 
associated with a weight, each list holding references to 
buffers having said associated weight, and each list 
having links to next and previous lists associated 
respectively with weights one greater and one less than 
the subject list's associated weight; 
placing each buffer reference in a list associated with the 

weight of the buffer; 
upon incrementing a buffer's weight by one, moving its 
reference from its current list to the next list, and upon 
decrementing an buffer's weight by one, moving its 
reference from its current list to the previous list; and 
for each list, in order of descending associated weight, 
selecting buffers which do not share input or output 
nodes with buffers which have already been selected. 

39. A method for scheduling transmission of cells through 
a data switch having a plurality of inputs and outputs, 
comprising the steps of: 

providing, at each input, a plurality of buffers correspond- 
ing to the outputs, said buffers temporarily holding 
cells; 

assigning a weight to each buffer; 
providing a data structure of linked lists, each list being 
associated with a weight, each list holding references to 
buffers having said associated weight, and each list 
having links to next and previous lists associated 
respectively with weights one greater and one less than 
the subject list's associated weight; 
placing each buffer reference in a list associated with the 

weight of the buffer; 
upon incrementing a buffer's weight by one, moving its 
reference from its current list to the next list, and upon 
decrementing an buffer's weight by one, moving its 
reference from its current list to the previous list; 
for each list, in order of descending associated weight, 
selecting buffers which do not share input or output 
nodes with buffers which have already been selected; 
and 
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transmitting cells from the selected buffers to the corre- 
sponding outputs. 

40. A data switch, comprising: 

a plurality of inputs and outputs; 

at each input, a plurality of buffers corresponding to the s 
outputs, said buffers temporarily holding cells, wherein 
a weight is assigned to each buffer, said buffers being 
selected according to a weighted matching of inputs 
and outputs as determined by a stable marriage 
algorithm, such that each unselected buffer shares an 10 
input or an output with a selected buffer whose weight 
is greater or equal to the unselected buffer's weight, 
within each of a plurality of timeslots, credits being 
assigned to each buffer according to a guaranteed 
bandwidth for that buffer such that the weight associ- 15 
ated with each buffer is based on an accumulated 
number of credits associated with the buffer; and 

a switch fabric through which cells from the selected 
buffers are transmitted to the corresponding outputs. 2Q 

41. The data switch of claim 40 wherein the weight 
associated with a buffer is zero if the buffer is empty, 
regardless of actual credit. 

42. The data switch of claim 40 wherein a credit bucket 
size is assigned to each buffer, such that if a buffer is empty 2$ 
and has a number of credits exceeding its associated credit 
bucket size, the buffer receives no further credits. 

43. The data switch of claim 40 wherein each weight 
associated with a buffer is set to either the buffer's length, or 

to the number of credits associated with the buffer. 3Q 

44. The data switch of claim 40 wherein each weight 
associated with a buffer is set to either the buffer's length, or 
to the number of credits associated with the buffer, which- 
ever is less. 

45. The data switch of claim 40 wherein each weight 3S 
associated with a buffer is set to a validated waiting time 
associated with an oldest cell in the buffer. 

46. The data switch of claim 45 wherein the validated 
waiting time for each cell is based on current time and the 
cell's validation time, the cell's validation time being when 4Q 
the cell is validated by an available credit. 

47. The data switch of claim 45 wherein the validated 
waiting time of an oldest cell is calculated as a minimum of 
actual waiting time of the oldest cell and an age of an oldest 
credit associated with the buffer. 45 

48. The data switch of claim 45 wherein the validated 
waiting time of an oldest cell is estimated, the estimation 
being based on actual waiting time of the oldest cell, a 
number of credits associated with the buffer, and the rate at 
which credits are accrued. 5Q 

49. The data switch of claim 45 wherein each weight 
associated with a buffer is scaled by the guaranteed band- 
width associated with the buffer, 

50. A data switch, comprising: 

a plurality of inputs and outputs; 55 
at each input, a plurality of buffers corresponding to the 
outputs, said buffers temporarily holding cells, wherein 
a weight is assigned to each buffer, said buffers being 
selected according to a weighted matching of inputs 
and outputs as determined by a stable marriage $o 
algorithm, such that each unselected buffer shares an 
input or an output with a selected buffer whose weight 
is greater or equal to the unselected buffer's weight; and 
a switch fabric through which cells from the selected 
buffers are transmitted to the corresponding outputs, a 65 
second matching between remaining inputs and outputs 
being determined, buffers being selected according to 
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the second matching and cells being transmitted from 
the selected buffers to the corresponding outputs. 

51. The data switch of claim 50, wherein the second 
matching is determined according to a max-min fairness 
algorithm. 

52. The data switch of claim 50, wherein the second 
matching is based on usage weights during a second phase 
of weight assignments. 

53. The data switch of claim 50, wherein weights are 
based on usage and credits. 

54. A data switch, comprising: 

a plurality of inputs and outputs; 

at each input, a plurality of buffers corresponding to the 
outputs, said buffers temporarily holding cells, wherein 
a weight is assigned to each buffer, said buffers being 
selected according to a weighted matching of inputs 
and outputs such that each unselected buffer shares an 
input or an output with a selected buffer whose weight 
is greater or equal to the unselected buffer's weight; 

a switch fabric through which cells from the selected 
buffers are transmitted to the corresponding outputs; 
and 

wherein several virtual connections share the same input- 
output pair, each virtual connection having its own 
guaranteed rate, the data switch further comprising, at 
each input, a buffer for each virtual connection, such 
that 

for each input/output pair, a virtual connection within 
the input/output pair, having a maximum weight, is 
identified, 

the maximum weight corresponding to the identified 

virtual connection, is assigned to the respective 

input/output pair, 
input/output pairs are selected based on the assigned 

weights, and according to a maximal weighted 

matching, and 
cells are transmitted from the selected inputs to the 

corresponding outputs. 

55. A data switch, comprising: 

a plurality of inputs and outputs; 
at each input, a plurality of buffers corresponding to the 
outputs, said buffers temporarily holding cells, such 
that within each of a plurality of timeslots, 
credits are assigned to each buffer according to a 

guaranteed bandwidth for that buffer, 
a weight is assigned to each buffer as a function of 
credits, 

a matching of inputs and outputs, based on the assigned 

buffer weights, is determined and buffers are selected 

according to the matching, and 
a cell is transmitted from each of the selected buffers to 

their corresponding outputs and a credit is removed 

from each of the selected buffers . 

56. A data switch, comprising: 

a plurality of inputs and outputs; 

at each input, a plurality of buffers corresponding to the 
outputs, said buffers temporarily holding cells, wherein 
a weight is assigned to each buffer; 

a data structure of linked lists, each list being associated 
with a weight, each list holding references to buffers 
having said associated weight, and each list having 
links to next and previous lists associated respectively 
with weights one greater and one less than the subject 
list's associated weight, wherein 
each buffer reference is placed in a list associated with 
the weight of the buffer, 
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upon incrementing a buffer's weight by one, its refer- 
ence is moved from its current list to the next list, and 
upon decrementing an buffer's weight by one, its 
reference is moved from its current list to the pre- 
vious list, and 
for each list, in order of descending associated weight, 
buffers are selected which do not share input or 
output nodes with buffers which have already been 
selected. 
57. A data switch, comprising: 
a plurality of inputs and outputs; 
at each input, a plurality of buffer means for temporarily 
holding cells, each buffer means corresponding to the 
outputs; 

credit assigning means for assigning credits to each buffer 
means; 

weight assigning means for assigning a weight to each 
buffer means as a function of credits; 

matching means for determining a matching of inputs and 
outputs, based on the assigned weights, such that buffer 
means are selected according to the matching, and 

transmission means through which cells from the selected 
buffer means are transmitted to the corresponding 
outputs, wherein a credit is removed from each of the 
selected buffers. 



10 
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58. A data switch, comprising: 

a plurality of inputs and outputs; 

at each input, a plurality of buffer means for temporarily 

holding cells, said buffer means corresponding to the 

outputs; 

weight assigning means for assigning a weight to each 
buffer means; 

list means, each list means being associated with a weight, 
for holding references to buffer means having said 
associated weight, and each list means having link 
means to next and previous list means associated 
respectively with weights one greater and one less than 
the subject list means' associated weight, wherein 
each buffer means reference is placed in a list means 

associated with the weight of the buffer means, 
upon incrementing a buffer means 5 weight by one, its 
reference is moved from its current list means to the 
next list means, and upon decrementing a buffer 
means* weight by one, its reference is moved from • 
its current list means to the previous list means, and 
for each list means, in order of descending associated 
weight, buffer means are selected which do not share 
input or output nodes with buffer means which have 
already been selected. 
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